SearchScale / dataimporthandler

Repository for DIH (Document Import Handler)
Apache License 2.0
68 stars 48 forks source link

Serverless DIH #62

Closed mkhludnev closed 7 months ago

mkhludnev commented 11 months ago

yeahh// it's just a clickbait.

I just get known about stateless coordinator node and decided that it might be useful for DIH.

TLDR;

introduce SolrCloudWriter with destination-collection parameter.

Context

Can't migrate out of DIH, attempting to run it in cloud (zk distributed Solr).

Problem

DIH runs in one or the replicas overload it and make cluster unstable.

Suggestion

  1. provision coordinator only node as described in guide
  2. hit it with search request for collection you need to index (let's call it A), it will bring up single shard .sys.COORDINATOR-COLL-configset-A collection.
  3. deploy DIH with this fix applied into this collection (see README.MD), although it causes DIH deployed onto A as well, since they share the configset.
  4. curl /.sys.COORDINATOR-COLL-configset-A/dataimport?command=full-import&writerImpl=SolrCloudWriter&destination-collection=A&..
  5. voila! you have DIH run on coordinator only (stateless, I even may say serveless) node streaming updates onto a specified collection
  6. You may even drop coordinator node now (just to prove my point about serverlessnes)
  7. I suppose it even (should be) a kind of performant since it uses SolrCmdDistributor, which streams updates in parallel.
  8. Now it heavily uses the schema sharing via ZK since both collections use the same configset, and avoid it won't be easy.

I'm leaving it as draft PR, until someone share some thoughts about it.

epugh commented 11 months ago

i think this is super fascinating ;-). I wonder if having a .bats test to show all the steps would be interesting? Like I did for https://github.com/apache/solr/pull/1999 ?? Also, if youi used a SolrCloudClient instead of Http2SolrClient, would that simplify some of the logic around slices and getting leaders?

mkhludnev commented 11 months ago

Thanks, @epugh . I couldn't find a sample of constructing CloudSolrClient inside of Solr. It's used externally (not a surprise). Regarding bats, build scripts are far for maturity in this project.

epugh commented 11 months ago

So, here is another idea... Could you provide a demo.sh script that walks people thorugh all the steps to use this? I think a small bit of the challenge is that this uses a number of both new AND cool features of Solr... I'd love to just do "demo.sh" and see all the stuff happening...

epugh commented 11 months ago

Thanks, @epugh . I couldn't find a sample of constructing CloudSolrClient inside of Solr. It's used externally (not a surprise). Regarding bats, build scripts are far for maturity in this project.

Also, CloudSolrClient usage could totally be an optimization later.. What's key here is making DIH work better with Solr ;-).

noblepaul commented 8 months ago

@mkhludnev .

I'm thinking of merging this . Are tehre any loose ends that need to be tied up ?

mkhludnev commented 7 months ago

oh. cool. At least I'd like to cover all methods with tests. I'll check it during next week.

mkhludnev commented 7 months ago

@noblepaul , all let's name it as for now we have /dataimport?command=full-import&writerImpl=SolrCloudWriter&destination-collection=data

  1. Are there any more elegant approaches rather than passing dedicated SolrCloudWriter (made by this PR)?
  2. I'm bothered by destination-collection. Kebab-naming or CamelCase?
mkhludnev commented 7 months ago

@noblepaul I'm done with the code. I'm open for suggestions regarding naming.

mkhludnev commented 7 months ago

@noblepaul if you don't like using SolrCmdDistributor I can add SolrCloudSyncWriter which sends docs one-by one using SolrClient. Although I'm not sure how to pick a leader node.

noblepaul commented 7 months ago

we use camel case everywhere. For instance ,

destinationCollection instead of destination-collection

mkhludnev commented 7 months ago

destinationCollection

done

noblepaul commented 7 months ago

Thanks Michael

mkhludnev commented 7 months ago

Oh cool. Thanks Noble. Looking forward to release.