USCDataScience / nutch-analytics

Nutch Crawl Analysis - Spark based project
Apache License 2.0
4 stars 0 forks source link

Nutch-Analytics

This is an Apache Spark based project to analyze crawls generated by Apache Nutch. The project is still in incubation and has the CDRv2 dump feature for now.

The vision is to continue developing Analytical features for Nutch using Spark. This will also interesect with awesome concepts like Machine Learning and Natural Language Processing.

Build and Deploy

mvn clean install

Run Analytics

java -cp analytics-1.0.jar gov.nasa.jpl.analytics.dump.Cdrv2Dump -m local[*] -s PATH_TO_SEGMENT_FOLDER -o OUTPUT_FILE -l PATH_TO_LINK_DB

Contact Us

In case you have any questions or suggestions, please drop them at irds-l@mymaillists.usc.edu

Website: http://irds.usc.edu