AnantharamanLab / VIBRANT

Virus Identification By iteRative ANnoTation
GNU General Public License v3.0
142 stars 37 forks source link

Which VIBRANT file(s) to generate gene-to-genome file for vConTACT2? #29

Open satkinson0115 opened 3 years ago

satkinson0115 commented 3 years ago

Hi @KrisKieft,

I'm trying to run some VIBRANT output through vConTACT2 and I'm curious which VIBRANT output to use to generate the gene-to-genome file. I thought I saw a file with the associations with the HMMs, but now that I'm going to look for it I don't see anything like that. Could you clarify which output to use for this step?

Thanks, Samantha

P.S. Sorry for posting in the wrong place the first time. I thought it was related to the other thread!

KrisKieft commented 3 years ago

Hi Samantha,

Not a problem on the posting location. I've had this question before and just wanted to make it easier to find in the future. vConTACT2 takes in protein (.faa, .fasta) sequences of phages/viruses. They don't need to be annotated prior to analysis. A common input is protein in Prodigal format, which is simply the name of the scaffold followed by an underscore (_) and then the protein number. VIBRANT outputs this information plus extra in the viral proteins output file. You can use VIBRANT's output in addition to a little bit of post-processing to create a Prodigal-like input for vConTACT2.

Steps: Take the combined VIBRANT output viral protein sequences in VIBRANT_phages*/*combined.faa. The asterisk (*) in this case refers to the full name of your specific file(s). I'll call this file example_combined.faa. In the most recent update of VIBRANT I added a script called simplify_faa-ffn.py in the scripts folder. If you don't currently have this script you can just download it from GitHub separately because it requires no dependencies and is separate from VIBRANT. This is a very simple script that will trim off the extra information in example_combined.faa (or .ffn) and convert it directly to a Prodigal format file usable by vConTACT2. Do do so run python3 simplify_faa-ffn.py example_combined.faa. This will generate a file called example_combined.simple.faa. This output file is in Prodigal format and can be used to create the gene-to-genome.csv file required for vConTACT2.

Please let me know if more details or help is required.

Kris

satkinson0115 commented 3 years ago

I was able to successfully use the auxiliary script you provided to get the simplified .faa file. Am I supposed to run that file through Blast or some other annotation software to get the genome associated with it? The example gene-to-genome file on vConTACT2 page has a phage identity:

protein_id,contig_id,keywords
ref|NP_039777.1|,Sulfolobus spindle-shaped virus 1,ORF B-251
ref|NP_039778.1|,Sulfolobus spindle-shaped virus 1,ORF D-335

Or would my contig id be the k_ header; i.e.

k141_93196 flag=1 multi=6.0000 len=2379_1 in the simplified file becomes protein_id,contig_id,keywords 93193,k_141,keyword

Would having a number as the protein id break things and I should add a prefix before the prodigal number?

Thanks for your help! Samantha

KrisKieft commented 3 years ago

It depends on how your scaffold (contig) names are set up. The example on vConTACT2's site looks like it's straight from NCBI which is not necessary in your case. What you have appears to be right off an assembler which will also work just fine. Below is an example image of a gene-to-genome.csv file from one of my analyses. The first column is the simplified protein names. The second column is that simplified name but only the scaffold (contig) name, which is the protein name excluding the last underscore and number. The way I personally use vConTACT2 is to leave the keyword blank.

For your example it would be k141_93196 flag=1 multi=6.0000 len=2379_1, k141_93196 flag=1 multi=6.0000 len=2379, keyword. I think you may need to remove the spaces in the protein/scaffold names.

image

paulaist commented 3 years ago

@satkinson0115 did you get this to working?

satkinson0115 commented 3 years ago

@paulaist yes I did. I had it match Kris's format and I think I did remove the spaces as suggested, and it worked.