Closed cschu closed 7 months ago
Maybe the download step could be split by data source, so that e.g. the nucleotide
sequences will not have to be redownloaded if a genebank
download crashes.
That's a new one for sure. That can only happen if there are "na" entries in the "ftp_path" column from the Genbank summary table which is odd. If that is indeed the problem it would be better to filter them directly from the summary table before the matching. Let me check if that is the problem and deploy a fix.
Should be fixed now. Let me know if it works.
Regarding separating the downloads. We could do that, but would have to ensure that the nucleotide and genbank downloads never run in parallel because of the NCBI download rate limits. If you have an API key the nucleotide downloads are pretty fast, so I never really bothered before.
Should be fixed now. Let me know if it works.
Unfortunately not. This leads to match_taxids
crashing with
Error in if (matches[, max(score)] > 1) { :
missing value where TRUE/FALSE needed
Calls: ordered_match -> [ -> [.data.table -> find_taxon
Execution halted
My bad, try now. Sorry for the back and forth. I don't have access to a server until March so I can't test fixes myself. I will get back to building an updated DB then and will also fix the pipeline again.
Yea, that did the trick, cheers!
re updated DB/pipeline fixes - I'm having issues with the kraken build process, but for those I'll open another ticket.
Nice!
Hi,
in some cases, the
matches.csv
generated bymatch_taxids
contains genebank records with the URL column set tona
, e.g.4513,Eukaryota,Streptophyta,Magnoliopsida,Poales,Poaceae,Hordeum,Hordeum vulgare,GCA_949782835.1,genbank,4513,na,species
6550,Eukaryota,Mollusca,Bivalvia,Mytilida,Mytilidae,Mytilus,Mytilus edulis,GCA_963676595.2,genbank,6550,na,species
This can happen with different records at different times, e.g. the
H. vulgare
record appeared in my most recent run only, whereas theM. edulis
record was already set tona
in a previous run from last night.Now, these
na
URLs will causedownload_sequences
to crash with the following messages:I also noticed that the
Error in rbindlist(dls) :
message will appear when a sequences fails to download otherwise, e.g. whenrsync
is not available.A solution to suppress the crash for the URL=
na
records is to add a check to line 125 indownload.R
.This will then ignore such records. However, I am wondering whether this is a good solution as this will change the composition and thus the integrity of the database and make the results less reproducible.
Thanks, Christian