DiltheyLab / MetaMaps

Long-read metagenomic analysis
Other
98 stars 23 forks source link

Error when creating database #41

Open LankyCyril opened 4 years ago

LankyCyril commented 4 years ago

Hi. I run the downloadRefSeq.pl command -- downloadRefSeq.pl --seqencesOutDirectory data/metamaps-db/refseq --taxonomyOutDirectory data/metamaps-db/taxonomy, and after about two days of churning data and printing progress output, it just failed with "Cannot change working directory into assembly path na na: No such file or directory" and no other explanation. It had successfully processed all bacterial genomes but only got through 5 out of 323 fungal genomes. Looking into the data/metamaps-db/refseq/fungi dir, I actually see only six subdirectories for six species. assembly_summary.txt lists a lot more. I have about 20TB free disk space left, so it can't be that.

Does it mean that some previous data retrieval steps failed? Is there a way to safeguard against this? Or fix it and resume from where it left off?

JanMoat commented 2 years ago

I fixed the error by changing ftp to https in one line of downloadRefSeq.pl. Original: (my $assembly_path_FTP = $assembly_path_fullURL) =~ s/ftp:\/\/ftp.ncbi.nlm.nih.gov//g; New: (my $assembly_path_FTP = $assembly_path_fullURL) =~ s/https:\/\/ftp.ncbi.nlm.nih.gov//g;

There's a similar known problem & fix with Kraken2

srusher commented 5 months ago

I added a conditional statement in there that iterates to the next species if $assembly_path_fullURL == "na" - that's why that error was being thrown. I used the following sed command to insert the logic:

sed -i 's|# last SPECIES if($downloaded_assemblies > 100);|if($assembly_path_fullURL eq "na"){\n\t\t\t\tnext SPECIES; \n\t\t\t}\n|g' ./downloadRefSeq.pl

This will replace this comment line # last SPECIES if($downloaded_assemblies > 100); with the following if statement:

if($assembly_path_fullURL eq "na"){ next SPECIES; }

Keep in mind that if there is an update to MetaMaps and the # last SPECIES if($downloaded_assemblies > 100); comment is removed, this sed statement won't work