alexpiper / taxreturn

An R package for creating taxonomic reference databases for metabarcoding studies
GNU General Public License v3.0
8 stars 1 forks source link

Error: Service Unavailable (HTTP 503) when using a list of taxa and fetchSeqs with BOLD #20

Open morien opened 3 years ago

morien commented 3 years ago

This is more a note that should be in the documentation/a suggestion for something to add to the vignette. Based on my experimentation with this command: fetchSeqs(clade_list[1:N], database="bold", out.dir="bold", marker="COI-5P", output = "gb-binom", compress=TRUE, force=TRUE, multithread = TRUE)

the HTTP 503 error pops up somewhere north of N=300. Because that's such an odd number, I am wondering about whether it is directly related to N, or to the number of returned sequences. I'm not familiar enough to know how this works within BOLD. But an FYI to others trying to fetch BOLD sequences using large query lists.

alexpiper commented 3 years ago

Sorry I missed these previous issues. There is definitely a problem with overloading the BOLD servers when requesting a lot of data. However, its not clear how much exactly is 'too much' as there doesn't seem to be any documentation on BOLD side about this.

From the testing I've done, I'm believe this error is related to the total number of records requested per query. There's currently a check in the fetchSeqs function to see if the maximum number of records exceeds 100,000, and if so the query is split into smaller queries of lower taxonomic ranks (i.e. a search for "Insecta" will be split into "Diptera", "Coleoptera", etc).

While a limit of 100,000 records per query worked for me when downloading all Insecta sequences, it was a pretty arbitrary choice. So I've now made this maximum query size parameter editable using the chunksize argument to fetchSeqs.

You could try setting chunksize to a lower value (i.e. 50,000) and see if you still get the error. But I wouldn't go too low as it adds a fair bit of additional runtime at the start to check the amount of records per taxa, and also is slower to download the actual data using more queries.

morien commented 3 years ago

That's great, thanks for making that change. I'll test it out and see if it changes things on my end.