Nutch1-Solr5 Integration, Searching the Web

camilotejeiro commented 7 years ago

I do not own these comments, these were copied from my old Wordpress.com blog verbatim, in case it helps other readers.

Author: GM

Thank you for these tutorials. I had a hard time finding the info I needed when moving to Nutch 1.11. I’ve got everything running now, except my core has no documents in it. From the nutch directory, I run bin/nutch solrindex \ http://localhost:8983/solr/nutch_solr_data_core \ crawl/crawldb/ -linkdb crawl/linkdb/ $s1

It appears to work well, but no documents make their way into the nutch_solr_data_core according to Solr Admin’s Core Admin.

Searching on any term doesn’t bring back any results.

The only thing that looks like an error message to me is: java.io.IOException: No FileSystem for scheme: http

Can you point me in the right direction? I’m not looking for you to fix my problem. You’ve already been so helpful. Just need a nudge.

camilotejeiro commented 7 years ago

I do not own these comments, these were copied from my old Wordpress.com blog verbatim, in case it helps other readers.

Author: steve2k2

GM Same thing with me – I don’t see any error, but no documents are returned. I did an echo $s1 and it appears not set. I think I had set it a couple of tutorials ago, but I guess I need to find out how to set it again.

Anyway – here’s what happened for me:

steve@quark:/usr/local/nutch/framework/apache-nutch-1.6$ sudo bin/nutch solrindex http://localhost:8983/solr/nutch_solr_data_core crawl/crawldb/ -linkdb crawl/linkdb/ $s1 SolrIndexer: starting at 2016-07-08 00:15:45 SolrIndexer: deleting gone documents: false SolrIndexer: URL filtering: false SolrIndexer: URL normalizing: false SolrIndexer: finished at 2016-07-08 00:15:55, elapsed: 00:00:10

camilotejeiro commented 7 years ago

I do not own these comments, these were copied from my old Wordpress.com blog verbatim, in case it helps other readers.

Author: mrcpuig

I had the same problem and I realize that a UI is available at http://localhost:8983/solr/#/nutch_solr_data_core/query The query to get all documents is something like: http://localhost:8983/solr/nutch_solr_data_core/select?indent=on&q=:&wt=json

Thanks for the post 🙂

camilotejeiro commented 7 years ago

I do not own these comments, these were copied from my old Wordpress.com blog verbatim, in case it helps other readers.

Author: suyash

: Could not load conf for core nutch_solr_data_core: Error loading solr config from /opt/solr-5.2.1/server/solr/nutch_solr_data_core/conf/solrconfig.xml

Please help out

camilotejeiro commented 7 years ago

I do not own these comments, these were copied from my old Wordpress.com blog verbatim, in case it helps other readers.

Author: Markus Moranz

Thank you Camilo,

for this best hint, I could find anywhere, where Nutch/Solr integration is discussed. Nutch worked before, Solr worked before, but your guide above is the required glue in files and in process.

After making Nutch 1.12 work with Solr 4.10.4 as normal and documented elsewhere. Based on your tipps it worked with Solr 5.5.2. And finally I was able to make that work also with Solr 6.1.0 today on OS X El Capitan.

What I did: download and extract nutch 1.12 export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_102.jdk/Contents/Home echo $JAVA_HOME bin/nutch inject crawl/crawldb urls bin/nutch generate crawl/crawldb crawl/segments s1=ls -d crawl/segments/2 | tail -1 echo $s1 bin/nutch fetch $s1 bin/nutch parse $s1 bin/nutch updatedb crawl/crawldb $s1 bin/nutch generate crawl/crawldb crawl/segments -topN 1000 s2=ls -d crawl/segments/2 | tail -1 echo $s2 bin/nutch fetch $s2 bin/nutch parse $s2 bin/nutch updatedb crawl/crawldb $s2 bin/nutch generate crawl/crawldb crawl/segments -topN 1000 s3=ls -d crawl/segments/2* | tail -1 echo $s3 bin/nutch fetch $s3 bin/nutch parse $s3 bin/nutch updatedb crawl/crawldb $s3 bin/nutch invertlinks crawl/linkdb -dir crawl/segments

download and extract solr 6.1.0 bin/solr start bin/solr create -c foo –> this created “foo” in $solr_home/server/solr/ with conf inside <– following your process above I replaced files in foo/conf that were downloaded from your links above bin/solr restart bin/nutch solrindex http://127.0.0.1:8983/solr/foo crawl/crawldb -linkdb crawl/linkdb $s3 -filter -normalize

–> documents could be queried successful within Solr or simply in browser: http://localhost:8983/solr/foo/select?q=$querystring

Thanks again 🙂

camilotejeiro commented 7 years ago

I do not own these comments, these were copied from my old Wordpress.com blog verbatim, in case it helps other readers.

Author: Zachary

Got stuck at step 6 and ended up doing this to get everything working:

Copy the conf/schema.xml file from Nutch into your Solr core conf folder (e.g. “cp /opt/apache-nutch-1.13/conf/schema.xml /opt/solr-6.5.1/server/solr/nutch_solr_data_core/conf/schema.xml”) Restart Solr. If it’s throwing some errors along the lines of “Plugin init failure for….”, edit the new schema.xml and remove all instances of enablePositionIncrements=”true”. Restart Solr again. From the main Apache Nutch directory, run the following command: bin/nutch index -Dsolr.server.url=http://localhost:8983/solr/nutch_solr_data_core crawl/crawldb/ -linkdb crawl/linkdb/ $s1

Notes:

The command “solrindex” is deprecated, hence why I’m using the “index” command above.
If the indexer says the job failed, make sure your URL does not contain the #, verify the port number and the core name, and try using full paths to make sure.
If you have write.lock errors, verify the ownership and permissions of the nutch_solr_data_core directory. After fixing those, delete the nutch_solr_data_core/data/index/write.lock file and then restart Solr.

camilotejeiro / camilotejeiro.github.io

Nutch1-Solr5 Integration, Searching the Web #2