apache / incubator-stormcrawler

A scalable, mature and versatile web crawler based on Apache Storm
https://stormcrawler.apache.org/
Apache License 2.0
889 stars 262 forks source link

Add support for shards in SOLR #620

Open jnioche opened 6 years ago

jnioche commented 6 years ago

Just like it's done in ES, we could route the documents in the statusupdaterbolt based on the host / name or IP and in the spouts check that the number of instances is equal to the # of shards and filter the queries per shard accordingly.

At the moment, we can have only one instance of a spout.

jnioche commented 2 years ago

https://solr.apache.org/guide/solr/latest/deployment-guide/solrcloud-shards-indexing.html#document-routing

mvolikas commented 4 months ago

This one is really interesting. Will it be up to the user to correctly create the status core with the same number of shards as the parallelism: *** used in the crawler.flux?

jnioche commented 4 months ago

This one is really interesting. Will it be up to the user to correctly create the status core with the same number of shards as the parallelism: *** used in the crawler.flux?

yes. we should set it to a reasonable default value (10?) but then it is up to the user to manage it

mvolikas commented 3 months ago

@jnioche can we merge branch 851 into main at this point? This way we can update / add tests along with the new functionality.

jnioche commented 3 months ago

Hasn't that been done in https://github.com/apache/incubator-stormcrawler/pull/1240? Is there more in that branch that hasn't been merged? If so, would you mind creating a PR from that branch? Thanks!

mvolikas commented 2 months ago

Hasn't that been done in #1240? Is there more in that branch that hasn't been merged? If so, would you mind creating a PR from that branch? Thanks!

The changes from #1240 were merged into branch apache:851 but not into main. Should I open an additional PR from apache:851 to main?

jnioche commented 2 months ago

Hi @mvolikas - sorry about the delayed response, I have just returned from holidays

Should I open an additional PR from apache:851 to main?

yes please, that would be great