How to perform data mining on massive data?

This is a note for me to think about the appropriate strategy for data mining.

Since I have figured out the basic structure of data mining and learned some JDM API, I start to doubt that JDM is really a good approach. The fact is that, if we want to get a well designed model to do prediction, we have to calculate large amount of data. However, their is no way for JDM to calculate with distributed computing. The only way to improve calculating speed is to apply asynchronous computing. That is, there is a limitation for speeding up with JDM. In fact, there are lots of frameworks and API we can use to perform data mining and can get better speed up than JDM, such as mahout, a hadoop library.

Although it seems like that mahout should be our best choice, it still has some problems, such as device and technique. In conclusion, if we decide to take a small part of data to just make testing, JDM is sufficient for us. However, if we want to do a real job, JDM can't be used.

blueworrybear / DolorMag

How to perform data mining on massive data? #3