YahooArchive / anthelion

Anthelion is a plugin for Apache Nutch to crawl semantic annotations within HTML pages.
https://labs.yahoo.com/publications/6702/focused-crawling-structured-data
Apache License 2.0
2.84k stars 666 forks source link

Seems the HDFS file path contains ':' colon will throw exception #4

Open stevegy opened 8 years ago

stevegy commented 8 years ago

I had download this whole source code and built it successfully. When i tried to run a crawl test: bin/crawl urls/ TestCrawl/ http://localhost:8983/solr/nutch 2 I run into this URI path name issue. hadoop.log.zip

i have this log file attached. It seems the HDFS file path name special characters issue is still there?

2016-01-03 13:27:08,405 INFO fetcher.Fetcher - Fetcher: starting at 2016-01-03 13:27:08 2016-01-03 13:27:08,405 INFO fetcher.Fetcher - Fetcher: segment: TestCrawl/segments/drwxr-xr-xnn4nstevennstaffnn136nJannn3n13:24n20160103090925 2016-01-03 13:27:08,406 INFO fetcher.Fetcher - Fetcher Timelimit set for : 1451809628406 2016-01-03 13:27:08,631 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2016-01-03 13:27:08,677 ERROR fetcher.Fetcher - Fetcher: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: drwxr-xr-xnn4nstevennstaffnn136nJannn3n13:24n20160103090925 at org.apache.hadoop.fs.Path.initialize(Path.java:148) at org.apache.hadoop.fs.Path.(Path.java:126) at org.apache.hadoop.fs.Path.(Path.java:50)

petarR commented 8 years ago

Hi,

You could try the fix given in this thread.

Or simply use the following command to start the crawl: runtime/local/bin/nutch crawl urls/ -solr http://localhost:8983/solr/ -dir TestCrawl -depth 3 -topN 50