DiltheyLab / MetaMaps

Long-read metagenomic analysis
Other
96 stars 23 forks source link

Error downloading RefSeq - provide prebuilt db for RefSeq archaea, bacteria, fungi? #59

Open dportik opened 3 years ago

dportik commented 3 years ago

Hello, I have been trying to create a database based on RefSeq archaea, bacteria, and fungi using the command below:

downloadRefSeq.pl --DB refseq --seqencesOutDirectory seqs --taxonomyOutDirectory taxonomy --targetBranches archaea,bacteria,fungi --skipIncompleteGenomes 1

I have encountered two issues so far. The first error occurred sporadically while unpacking taxdump.tar.gz. The error contained tar: Unexpected EOF in archive. I am running this on HPC and suspected it was due to file latency. I solved it by adding sleep (20); to line 78 of downloadRefSeq.pl, which may be helpful to others.

The second issue is more problematic and occurs when archaea finishes and bacteria begins:

Processing 35767 entriesField number mismatch in file seqs/bacteria/assembly_summary.txt - 21 / 2 at downloadRefSeq.pl line 178, <ASMSUM> line 204072.

Note that my script has an additional line inserted, so this corresponds to line 177 of the original script. I am unsure what this is related to, and cannot figure out a solution.

More importantly, I am not confident that if this problem is solved I can successfully build this database through the other required steps. There appear to be other unsolved database issues that have been posted here by other users (particularly 49). As the database building steps are time-consuming, I am reluctant to continue the effort unless the process is more robust.

Is it possible to host a prebuilt metamaps database for RefSeq archaea, bacteria, and fungi? I imagine this is the database most of your users will be interested in using for their analyses. Using the mini database is not particularly helpful, as most other taxonomic profilers offer access to very large databases (NCBI nt/nr, multiple RefSeq branches). This could solve at least some of the ongoing database issues, and would be a valuable resource for your user-base.

tim488 commented 3 years ago

@AlexanderDilthey

wchow commented 3 years ago

I am bumping this as a prebuilt database for refseq would be excellent if available.

molbio7 commented 2 years ago

I encountered a similar issue and has been unsuccessful so far in generating refseq database - prebuilt Refseq database would be very helpful to many users of Metamaps.