EnquistLab / RTNRS

R package for the (plant) Taxonomic Name Resolution Service
https://bien.nceas.ucsb.edu/bien/tools/tnrs/
Other
8 stars 0 forks source link

Best strategy for matching 100k names #8

Closed Rekyt closed 1 year ago

Rekyt commented 2 years ago

Hi @bmaitner 👋

Me again for a question that I couldn't answer myself.

I haven't noticed that TNRS() proceeds by chunks of 1000 names (as seen in https://github.com/EnquistLab/RTNRS/blob/8958ab189bb433846967d6d64a2c905776cb0543/R/TNRS.R#L101) maybe this could be mentioned in the doc/in the vignette.

Also, what would be the best strategy to match 100k names with TNRS. Should I try to split it myself into chunks of a fews 1000s names before hand and then use parallel calls to TNRS or there is a risk of overflooding the server?

I know it's probably not the most useful case but these days I'm very often manipulating data with an order of 100k names, so it would be really helpful to get a final answer to this question :)

bmaitner commented 2 years ago

Hey @Rekyt

Great point, I'll add that to the documentation/vignette. It's actually really useful to have someone like you putting the TNRS package to some serious challenges, so keep the questions/comments coming!

Re: best practices on such queries, that's a good question. Since this depends on how the API side of things handles queries, maybe @ojalaquellueva has some thoughts. Brad, for very large queries (i.e. hundreds of thousands), what is the best approach? Currently we break large queries into chunks of 1000 names at a time and run them one at a time. As Matthias says, we could also implement a parallel approach where we send multiple chunks at once, but I suppose this could potentially be counterproductive if multiple queries at once would slow things down.

bmaitner commented 2 years ago

@ojalaquellueva alternatively, could we add a "batches" argument, as we do for the GNRS? Then it could be handled on the API side of things

ojalaquellueva commented 2 years ago

@Rekyt I'm not sure I understand the issue. The service already runs in parallel, with the number of batches set server-side. There is a record limit for access via the API, but that is set to 5000 records, not 1000. So you should be able to loop through 5000 records at a time. @bmaitner, is some additional setting on your end lowering the limit to 1000?

bmaitner commented 2 years ago

Hey @ojalaquellueva : I've currently got it set to loop through in batches of 1000. I think this was per some earlier discussion we had, but it sounds like you've modified the API setting since then, so I'll up this to 5000. I think the question is then whether it would be useful to parallelize on the R end of things (sending multiple batches of 5000 records at once), or whether its better to run things in series on the R side.

ojalaquellueva commented 2 years ago

@bmaitner: I'm reluctant to fiddle with the record limit on the API end, at least in production. If you want to lift the burden of looping from the user it would be better to do it on your end in R. That said, I'm willing to temporarily raise or even remove the record limit by request on the public development instance, if that would help @Rekyt. It would actually be interesting to see what sort of performance we get by hitting the API via R without the limit.

ojalaquellueva commented 2 years ago

@bmaitner: BTW, there already is an api parameter "batches". It takes a positive integer argument < the number of names submitted. I haven't done a lot of performance tests, but you generally don't want to go lower than 100 names per batch. Chunk up your total names accordingly. Be careful with this parameter; it is not subjected to a lot of error checking and could crash Makeflow under certain conditions (if people start using it I may need to filter it more carefully). Myself, I generally leave batch parameterization to the application.

ojalaquellueva commented 2 years ago

@bmaitner: sorry I didn't answer your question about parallelizing on the R end. My initial response is "please don't". Makeflow already does an efficient job of distributing batches among the available cores. A large number of additional, concurrent requests are simply going to increase the queue of waiting batches. Plus I worry about about MySQL's ability to handle a large increase in concurrent requests without careful tuning on my part.

That said, I'd be willing to explore that option some time by running some performance tests. We'd need to plan carefully to get useful results. A lot of potential application and server settings are involved.

bmaitner commented 2 years ago

Thanks @ojalaquellueva . Sounds good. I'll increase the batch size on the R end of things from 1000 to 5000 and that should speed up processing of large batches at least somewhat.

@Rekyt Let us know if you'd be interested in some of the options @ojalaquellueva discussed in the thread. If you're likely going to be running such large queries fairly regularly, it may be worth it trying to figure out a way to optimize performance.

Rekyt commented 2 years ago

Thank you both @ojalaquellueva and @bmaitner!

I shouldn't run these queries too regularly but I'm happy to have learnt more about the inner workings of the API. Though it's not the most common use case, I would definitely recommend to write a little bit of documentation on how the queries are processed (parallelized on the server by chunks of 5000), and that there is no need to parallelize in R.

If you would be interested in having some data to do performance testing, I'll be happy to provide it!

bmaitner commented 1 year ago

I've added some additional text following @Rekyt 's suggestion to note that parallelization is handled on the API side of things and that there is not need to parallelize in R (doing so might actually slow things down...). Thanks as always, @Rekyt !