configure http collector to work with Shards

Norconex / crawlers

Norconex Crawlers (or spiders) are flexible web and filesystem crawlers for collecting, parsing, and manipulating data from the web or filesystem to various data repositories such as search engines.

https://opensource.norconex.com/crawlers

Apache License 2.0

183 stars 67 forks source link

configure http collector to work with Shards #444

Closed SolSearch closed 6 years ago

SolSearch commented 6 years ago

Hello,

I have downloaded the http collector and it works great with a core. I have a requirement to search more than one web site depending upon user's selection. I understand I need to create multiple shards and then include those shards in my query. I have started Solr in cloud mode and followed the script to create shards. How can I use Norconex Http collector to index web site documents? In the Norconex config file I have specified

http://localhost:8983/solr/gettingstarted_shard1_replica2/

but I don't see any documents indexed under gettingstarted_shard1_replica2.

I am using the same schema and solrconfig xml files as the one I used where I am able to index documents in a core.

SolSearch commented 6 years ago

http://localhost:8983/solr/gettingstarted_shard1_replica2/

SolSearch commented 6 years ago

I have placed http://localhost:8983/solr/gettingstarted_shard1_replica2/ under the committer tag

essiembre commented 6 years ago

If you are using Solr Cloud, you should reference your collection. Based on the URL sample you provided, it probably is http://localhost:8983/solr/gettingstarted. See if that makes a difference. If not, check the HTTP Collector logs and the Solr logs for potential errors.

joettt commented 6 years ago

I have tried with http://localhost:8983/solr/gettingstarted but still don't see any documents in the collection.

I know it is more of a general solr question but I want to make sure I am doing it right. Perhaps I don't need to have multiple shards. I want to be able to search the contents of website 1 OR website 2 OR (website 1 AND website 2).

With a collection with two shards, how do I know which shard to search if I only want to search website 1 and not website 2, for example. As I understand website 1 contents could be in either one of the two shards. I am wondering if the better approach is to index the website 1 and website 2 documents (about 1 million documents) indexed in one core and retrieve the documents from the two sites using the fq parameter, e.g., fq=webtype:web1.

essiembre commented 6 years ago

With Solr Cloud, you reference "collections", which can be spread across one or several shards. That should be transparent to you when you query. If you want to isolate each, it is probably easier to simply create a different Solr collection for each or, add a field to your existing collection that tells you what the source of the document is (you can filter on that). You can use a ConstantTagger from the Importer module to help with that.

What query do you issue to find documents? And can you confirm you have no errors in the HTTP Collector and Solr logs? Please attach them if you can.

SolSearch commented 6 years ago

If I create separate Solr collections then how can I specify to include collection 1 and not collection 2? This is why I am probably better off to include the documents in one collection and then filter the query by a field that tells the source of the document, as you also suggested. In any case, I don't see any errors in the attached logs but for now if I use the approach of filtering by the source, the issue is not relevant.

solr.log Norconex_32_Minimum_32_Test_32_Page.log

essiembre commented 6 years ago

Are you still having issues committing documents or can we close?

joettt commented 6 years ago

I am able to commit documents when I create a core. You can close the case, thanks. On Tuesday, January 16, 2018, 9:25:48 PM EST, Pascal Essiembre notifications@github.com wrote:

Are you still having issues committing documents or can we close?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.