Gibbons-Lab / medi

Metagenomic Estimation of Dietary Intake and Content.
Apache License 2.0
8 stars 4 forks source link

download_sequences crashes with "na" URLs #3

Closed cschu closed 7 months ago

cschu commented 7 months ago

Hi,

in some cases, the matches.csv generated by match_taxids contains genebank records with the URL column set to na, e.g.

4513,Eukaryota,Streptophyta,Magnoliopsida,Poales,Poaceae,Hordeum,Hordeum vulgare,GCA_949782835.1,genbank,4513,na,species 6550,Eukaryota,Mollusca,Bivalvia,Mytilida,Mytilidae,Mytilus,Mytilus edulis,GCA_963676595.2,genbank,6550,na,species

This can happen with different records at different times, e.g. the H. vulgare record appeared in my most recent run only, whereas the M. edulis record was already set to na in a previous run from last night.

Now, these na URLs will cause download_sequences to crash with the following messages:

Warning message:
In parallel::mclapply(gb[, unique(id)], function(i) download_genome(gb[id ==  :
  scheduled core 7 encountered error in user code, all values of the job will be affected
Error in rbindlist(dls) :
  Item 7 of input is not a data.frame, data.table or list
Execution halted

I also noticed that the Error in rbindlist(dls) : message will appear when a sequences fails to download otherwise, e.g. when rsync is not available.

A solution to suppress the crash for the URL=na records is to add a check to line 125 in download.R.

gb <- matches[db == "genbank"][url != "na"]

This will then ignore such records. However, I am wondering whether this is a good solution as this will change the composition and thus the integrity of the database and make the results less reproducible.

Thanks, Christian

cschu commented 7 months ago

Maybe the download step could be split by data source, so that e.g. the nucleotide sequences will not have to be redownloaded if a genebank download crashes.

cdiener commented 7 months ago

That's a new one for sure. That can only happen if there are "na" entries in the "ftp_path" column from the Genbank summary table which is odd. If that is indeed the problem it would be better to filter them directly from the summary table before the matching. Let me check if that is the problem and deploy a fix.

cdiener commented 7 months ago

Should be fixed now. Let me know if it works.

Regarding separating the downloads. We could do that, but would have to ensure that the nucleotide and genbank downloads never run in parallel because of the NCBI download rate limits. If you have an API key the nucleotide downloads are pretty fast, so I never really bothered before.

cschu commented 7 months ago

Should be fixed now. Let me know if it works.

Unfortunately not. This leads to match_taxids crashing with

Error in if (matches[, max(score)] > 1) { :
  missing value where TRUE/FALSE needed
Calls: ordered_match -> [ -> [.data.table -> find_taxon
Execution halted
cdiener commented 7 months ago

My bad, try now. Sorry for the back and forth. I don't have access to a server until March so I can't test fixes myself. I will get back to building an updated DB then and will also fix the pipeline again.

cschu commented 7 months ago

Yea, that did the trick, cheers!

re updated DB/pipeline fixes - I'm having issues with the kraken build process, but for those I'll open another ticket.

cdiener commented 7 months ago

Nice!