AnantharamanLab / VIBRANT

Virus Identification By iteRative ANnoTation
GNU General Public License v3.0
142 stars 37 forks source link

KeyError with "NODE_X_length_X" contig names #49

Open alexmsalmeida opened 3 years ago

alexmsalmeida commented 3 years ago

Hi,

I have a set of assemblies generated from metaSPAdes, and am getting a "KeyError" for only some of them when running VIBRANT. The contig names do not have any special characters or spaces. These are being run as part of a snakemake pipeline, so each assembly is processed in a separate folder. I am using the latest version (1.2.1).

You can find an example of an assembly that is failing here: http://ftp.ebi.ac.uk/pub/databases/metagenomics/genome_sets/viral_test.fa

For this assembly I get the following error:

Traceback (most recent call last):
  File "/hps/nobackup2/production/metagenomics/aalmeida/scripts/EMBL-EBI/snakemake_wfs/virsearch/.snakemake/conda/e7b417ea76be6dfa5c14e269d79f0482/bin/VIBRANT_run.py", line 599, in <module>
    genbank.write('//\nLOCUS       ' + str(genbank_database[n]) + '              ' + str(length_database[genbank_database[n]]) + ' bp   DNA  ' + str(form) + '   VRL ' + str(date.today()) + 
'\nDEFINITION  ' + str(genbank_database[n]) + '.\nFEATURES           Location/Qualifiers\n   source       /organism="' + str(genbank_database[n]) + '"\n')
KeyError: 'NODE_48_length_87362_cov_21.002703'

I am running VIBRANT with (-l 10000). Any ideas on what could be the issue?

Thanks in advance for the help, Alex

KrisKieft commented 3 years ago

Hi,

I usually do my best at answering these queries as fast as possible. I'm actually getting married in a couple days so I'm a bit behind. Feel free to send me a reminder next week as a reminder, but I unfortunately cannot get to it at this time. In the meantime, the first guess would be to check for tabs in your sequence names (see the most recent GitHub Issue) but this shouldn't happen from metaSPAdes.

Kris

KrisKieft commented 3 years ago

I actually just checked your file quick and didn't see anything odd with that sequence. I'll try to look in detail later. My only other quick idea is that conda can work odd for VIBRANT and sometimes switching to a GitHub download works better.

alexmsalmeida commented 3 years ago

Hi Kris,

Thanks for doing a quick check (and congrats on the marriage). I dug a bit further and made two other observations:

1) Whenever this error happens the contigs in question are only present in the lysogenic/lytic fna files, but not in the combined fna file. 2) The error does not seem to be reproducible for every file - I am getting them somewhat randomly for a small subset of these metaSPades assemblies.

We've had some issues with slow speeds in the filesystem of our HPC cluster. Is it possible that those contig names are being checked in the combined.fna file before they have been written there from their original lytic/lysogenic fna files? Is there anything that could be done to wait for the fna files to be properly written before moving on to the next step?

Best wishes, Alex

KrisKieft commented 3 years ago

If the issue is a slow computing cluster then telling the script to wait for several seconds intermittently may help. I have no idea if this will work but I attached a version of the run script that waits for a little while at various places. I had to zip the file so this Issues page would accept it as an attachment. Simply replace your existing VIBRANT_run.py script with this one (don't forget to unzip first so it overwrites what you have).

VIBRANT_run.py.zip

alexmsalmeida commented 3 years ago

Hi Kris,

Thanks for that, I will give it a go to see if I get the error again. In the meantime I managed to successfully analyse the first batch of assemblies after a few retries.

Thanks again, Alex