AnantharamanLab / VIBRANT

Virus Identification By iteRative ANnoTation
GNU General Public License v3.0
142 stars 37 forks source link

No data in any of the *.phages_*.fna files #48

Closed yue-clare-lou closed 3 years ago

yue-clare-lou commented 3 years ago

Hello,

I ran VIBRANT on ~142k fasta files that I know for sure that has a sufficient number of phage scaffolds. When the run was finished, I noticed a couple issues:

  1. all the outputs from VIBRANT were not put into the designated folders (i.e., VIBRANTphages, VIBRANTresults). Instead, they were all outside of the designated folders.
  2. while there are data in .phages_.faa and .phages_.ffn files, there is no data in any of the .phages_.fna files.

I wonder if these two issues have something to do with the error appeared in the VIBRANT_log_run file:

Traceback (most recent call last): File "VIBRANT_run.py", line 599, in genbank.write('//\nLOCUS ' + str(genbank_database[n]) + ' ' + str(length_database[genbank_database[n]]) + ' bp DNA ' + str(form) + ' VRL ' + str(date.today()) + '\nDEFINITION ' + str(genbank_database[n]) + '.\nCOMMENT Annotated using VIBRANT v1.2.1\nFEATURES Location/Qualifiers\n source /organism="' + str(genbank_database[n]) + '"\n')

KeyError: 'scaffold_X'

This scaffold_X was given a medium quality draft by VIBRANT and was assigned to be lysogenic by VIBRANT. This scaffold_X was scored to be high quality by checkV.

Thanks, Clare

KrisKieft commented 3 years ago

Hi Clare,

I'm actually not sure why it would have broken here. This step comes after virus identification and information for all viruses are used to build a genbank file. If scaffold_X was given a quality then it should have processed correctly. Did this only happen to one of the input files? Since virus identification did finish and it broke at post-processing you can use the information in the .faa/.ffn files, or the list of virus names .txt file, to grab the virus sequences. That of course isn't the best solution if the error occurred for many input files.

Here are some questions that may help me to figure it out: Which version of VIBRANT are you running? Are there any special characters in the sequence names, such as a pipe symbol ( | ). Are there multiple sequences called scaffold_X in the same file? Did any files with the same name, within the same folder, run at the same time?

Kris

yue-clare-lou commented 3 years ago

Hey Kris,

Which version of VIBRANT are you running? I am currently using version 1.2.1.

Are there any special characters in the sequence names, such as a pipe symbol ( | ). No. This is an example seq name of the scaffold that triggered the error message: >uvig_367585 SRR1761699_976 length_23967_VirSorter_cat_2

Are there multiple sequences called scaffold_X in the same file? No

Did any files with the same name, within the same folder, run at the same time? No

The database I am running VIBRANT on is the Gut Phage Database (https://doi.org/10.1016/j.cell.2021.01.029). I have run VIBRANT twice on this dataset and every time, it paused for the same error except that the scaffold that caused the pause was different each time (uvig_367585, uvig_456365). In both runs, these two scaffolds were rated by VIBRANT.

I want to extract all provirus sequences so I'd like to use .phages_lysogenic.fna file specifically. both .phages_lysogenic.faa and *.phages_lysogenic.ffn only output gene files so they are not ideal in my case.

KrisKieft commented 3 years ago

I'll download the database and look into it.

yue-clare-lou commented 3 years ago

Thanks a lot!

This is how I run VIBRANT: VIBRANT_run.py -i GPD_sequences.fa -folder VIBRANT -t 48

I am currently running VIBRANT on GPD database using the version v1.0.1. I wonder whether the error that I ran into has something to do with a specific version.

KrisKieft commented 3 years ago

v1.0.1 of VIBRANT? If so then that is the likely error. The initial releases (v1.0.0 and v1.0.1) had a few bugs. To my knowledge v1.2.1 is fully stable.

yue-clare-lou commented 3 years ago

I ran VIBRANT twice using v1.2.1 on the Gut Phage Database and I received the same type of error that I pointed out earlier.

I therefore just switched to v1.0.1 of VIBRANT to see if I will run into the same error. This is currently running.

KrisKieft commented 3 years ago

In all honesty v1.0.1 has a couple major issues and likely is not worth running. But if you're curious you can leave it running to see if you get the same error.

yue-clare-lou commented 3 years ago

i see, thanks for letting me know. I am just curious so I will let it run. I will let you know whether I run into the same error or not.

yue-clare-lou commented 3 years ago

Hey fyi - when using v1.0.1, I also ran into the same issue but it was triggered by a different scaffold ('uvig_338582'). I wonder if it is because the sequence names from the GPD don't get recognized by VIBRANT when it is trying to build the genbank file?

update -- I think I know why. The sequence names from the GPD contain both\t and space and it is the \t that is causing the crash of VIBRANT.

KrisKieft commented 3 years ago

Looks like it could be tabs? VIBRANT separates some information by tabs and assumes there will not be tabs in the definition lines of sequences. It can handle spaces and most things, but tabs may be an issue (I could be wrong here). Using grep it looks like every sequence has a tab. Try replacing tabs in the file and running again. An option to do this is with sed: cat GPD_sequences.fa | sed 's/\t/~/g' > GPD_sequences.no-tabs.fa. Replace ~ with whatever you want to replace the tabs with. If this solves the issue then I'll update the README.

image

image

yue-clare-lou commented 3 years ago

Hey yah it is a tab issue. I replaced tab with space and no more errors. Thanks!