apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

Support joining in a distributed environment. [LUCENE-3759] #4832

Open asfimport opened 12 years ago

asfimport commented 12 years ago

Add two more methods in JoinUtil to support joining in a distributed manner.

With these two methods distributed joining can be supported following these steps:

  1. Retrieve from values from each shard
  2. Merge the retrieved from values.
  3. Create a TermsQuery based on the merged from terms and send this query to all shards.

Migrated from LUCENE-3759 by Martijn van Groningen (@martijnvg), 22 votes, updated Aug 07 2018

asfimport commented 12 years ago

Jason Rutherglen (migrated from JIRA)

+1 Nice, distributed join will be super useful.

asfimport commented 12 years ago

Alex Liu (migrated from JIRA)

is there any performance concern?

asfimport commented 11 years ago

Colin Bartolome (migrated from JIRA)

This definitely affects Solr 4.1 and would be very helpful. I might not be able to run with shards without being able to use join queries.

asfimport commented 11 years ago

Colin Bartolome (migrated from JIRA)

Would implementing this as a TermsQuery open us up to TooManyClauses exceptions?

asfimport commented 10 years ago

Jerry Russell (migrated from JIRA)

Is there any progress on this? This seems like a very important feature that is missing from SOLR at this point.

asfimport commented 10 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

no patches == no progress.

asfimport commented 9 years ago

Joe Szymanski (migrated from JIRA)

Is anyone currently working on this? I want this feature bad enough that I plan on implementing this, but don't want to duplicate work.

asfimport commented 9 years ago

Scott Blum (@dragonsinth) (migrated from JIRA)

Joe Szymanski did you start on this?

Everyone else, I would love to work on this, but I'll need some high-level guidance. It's an area I haven't worked in before.

asfimport commented 9 years ago

Scott Blum (@dragonsinth) (migrated from JIRA)

Stupid question? But is this obseleted by https://issues.apache.org/jira/browse/SOLR-7584 or is this dealing with something different?

asfimport commented 9 years ago

Scott Blum (@dragonsinth) (migrated from JIRA)

Ping? Anyone still care or know anything about this issue?

asfimport commented 9 years ago

Jerry Russell (migrated from JIRA)

I am still waiting for it - but only if it can perform reasonably well..:)

asfimport commented 8 years ago

Scott Blum (@dragonsinth) (migrated from JIRA)

Request for feedback / comments: https://github.com/fullstorydev/lucene-solr/commits/scottb/fulljoin

Basically, it's a drop-in replacement for JoinQParserPlugin, except instead of curating the "from" terms from the local index, it does a collection-wide facet query to generate the term list.

asfimport commented 8 years ago

Ashish Datta (migrated from JIRA)

Hi guys, Is this issue being addressed in a future release etc. ? In order that Solr/Lucene be able to horizontally shard and yet give a unified view to queries that need to access joined data, I think this will be a BIG hit ! I saw a similar thing in the Mongo system where a 'queryrouter' did the same job of sending parallel query requests to multiple servers with individual shards and returned a consistent result. Though the two tools are entirely different, if the data/facets distribution and shard keying is known, this does not seem unsurmountable in Lucene. Would be really interested and eager to provide a use case in an actual production scenario where the lack of this feature is causing some grief ! and increasing the query coding to compensate for it.

asfimport commented 8 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Scott is asking a pertinent question I think. I really do wonder how much of the use-case here will be satisfied by both the Streaming Aggregations (5.x) and ParallelSQL (6.0).

I'd really like to have the use-case laid out and show that at least most of the use-cases are served by distributed joins and not the ParallelSQL capabilities before putting too much effort here.

asfimport commented 8 years ago

Ashish Datta (migrated from JIRA)

Hello Erick, I would be glad to present a case for this if it helps. Let me know if it helps. If it does not sound like a useful use-case, perhaps I could use some other tool. Here's a quick overview of the use-case: The requirement I have is in analytics. Search results need to be exact and we're basically 'counting' things precisely, not approximating. The no. of facets is not large but their combinations are large in number(hence the strong case for Solr). The number of distinct data containers(collections) is small but their sizes are large and denormalizing or keeping data in single servers are not feasible options. Therefore joins are becoming inevitable as data grows and starts to need many servers to store it due to size constraints and computing efficiency. Right now, the only option I have is to use a glue language to collect the 'from' terms from the many 'shards' across servers, send queries with these terms to the 'to' collection shards on several servers again, apply rules to aggregate them centrally, manage timeouts and other artificial issues created by this data division and sent the aggregated data for visualisations or other processing. As you can see, the charm and pull of Lucene's speed is getting dampened by the unnecessary data complexity and dependence on programming in a glue language , recording the number and types of shards on each server and making queries to the right targets. Redundancy/failover is another pain to handle besides managing increasing servers.

Everything I have written is already possible and avaliable in Solr except that it's not on a distributed manner ?

Solr is a beautiful tool that can easily do everything I need if my data were not needed to be distributed across machine as in my case ! If I denormalize this kind of data, I might end up making it 3-4x it's size, which obviously I dont want to do. If Solr managed to take away this pain, it would be the ideal scalable solution for all search applications and analytic applications which have multiple large, data sets with limits to denormalization. In my case, I know the data very well and have a good grip on the combinations of facets needed to configure a distributed system if it just allowed joins with true sharding.

I really think that adding this will bring in lots of distributed computing use-cases into the ambit of Solr. There's no telling the amount of efforts it will save for people like me, and not have everybody devising the own distributed computing management scheme when a common one could solve it for all.

Let me know if this sounds like a reasonable use-case. Besides my own use-case, I'm sure there would be a lot of people who probably dont use solr due to this missing feature.

PS : Sorry for getting carried away and the long mail ;-(

asfimport commented 6 years ago

jyoti Tiwari (migrated from JIRA)

Hello Ashish Dutta,

Please look into my issue which i am facing on solr4 while making join query across sharded collection on same node.

Solr4 on cloud joining across two sharded core i.e engineeringlogs_shard1_replica1 on machine 1 and micrologs_shard1_replica1 on machine 1

machine1 - engineeringlogs_shard1_replica1 (A), micrologs_shard1_replica1(B) machine 2- engineeringlogs_shard2_replica1(A1) , micrologs_shard2_replica1(B1)

query time join on engineeringlogs_shard1_replica1 (A):

fq: "{!join from=Log_type to=Log_type fromIndex=micrologs_shard1_replica1}SerialNumber:\"000123456789\""

want to perform join across A and A1 on same machine 1,but it is not working fine. error is:

"error": { "metadata": [ "error-class", "org.apache.solr.common.SolrException", "root-error-class", "org.apache.solr.common.SolrException" ], "msg": "Cross-core join: no such core micrologs_shard1_replica1", "code": 400 }

Please guide me how should i proceed so that this join query will work fine for sharded collection for solr4 cloud.

asfimport commented 6 years ago

jyoti Tiwari (migrated from JIRA)

Hello Ashish Dutta,

Please look into my issue which i am facing on solr4 while making join query across sharded collection on same node.

Solr4 on cloud joining across two sharded core i.e engineeringlogs_shard1_replica1 on machine 1 and micrologs_shard1_replica1 on machine 1

machine1 - engineeringlogs_shard1_replica1 (A), micrologs_shard1_replica1(B) machine 2- engineeringlogs_shard2_replica1(A1) , micrologs_shard2_replica1(B1)

query time join on engineeringlogs_shard1_replica1 (A):

fq: "{!join from=Log_type to=Log_type fromIndex=micrologs_shard1_replica1}SerialNumber:\"000123456789\""

want to perform join across A and A1 on same machine 1,but it is not working fine. error is:

"error": { "metadata": [ "error-class", "org.apache.solr.common.SolrException", "root-error-class", "org.apache.solr.common.SolrException" ], "msg": "Cross-core join: no such core micrologs_shard1_replica1", "code": 400 }

Please guide me how should i proceed so that this join query will work fine for sharded collection for solr4 cloud.

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Jhoti:

This is not an appropriate use of Solr's JIRA, the issue tracker is not a support portal. We try to reserve the JIRA system for code issues rather than usage questions.

Please ask the question here: solr-user@lucene.apache.org, see: http://lucene.apache.org/solr/community.html#mailing-lists-irc

It's extremely unlikely that any code changes will be considered for Solr 4, so the user's list is your best option.

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

I removed jyoti Tiwari as a watcher (along with "boxbe-notifications at boxbe.com") because their email system was auto-posting here. I also deleted the posts. Hopefully this will stop the cascading email posts.

asfimport commented 6 years ago

boxbe-notifications@boxbe.com (migrated from JIRA)

Hello Steve Rowe (JIRA),

Your message about "[jira] Steve Rowe mentioned you on LUCENE-3759 (JIRA) (JIRA)" has been waitlisted.

Please add yourself to my Boxbe Guest List so your messages will go to my Inbox.

Click the link below to be added: https://www.boxbe.com/crs?tc_serial=41904072875&tc_rand=1482900733&utm_source=stf&utm_medium=email&utm_campaign=CN_STDW&utm_content=002

Thank you, jyotitiwari609@gmail.com

Powered by Boxbe – "End Email Overload" Visit http://www.boxbe.com/how-it-works?tc_serial=41904072875&tc_rand=1482900733&utm_source=stf&utm_medium=email&utm_campaign=CN_STDW&utm_content=002

Boxbe, Inc. | 65 Broadway, Suite 601 | New York, NY 10006 Privacy Policy: http://www.boxbe.com/privacy | Unsubscribe: http://www.boxbe.com/unsubscribe