edgardomortiz / vcf2phylip

Convert SNPs in VCF format to PHYLIP, NEXUS, binary NEXUS, or FASTA alignments for phylogenetic analysis
GNU General Public License v3.0
294 stars 85 forks source link

Default output type of vcf2phylip.py: too many ambiguous nucleotide sequences? #51

Open LiuCanidk opened 1 month ago

LiuCanidk commented 1 month ago

Hi, thanks for developing this tool

I run the script of vcf2phylip.py successfully but found the output seems to be the amino acid sequences. My code and the screenshort of my output file are as follows:

python /work/share/acuwbf4fll/liucan/software/phylip/vcf2phylip-2.8/vcf2phylip.py -i /work/share/acuwbf4fll/liucan/HND_project/Bulk_RNA_variant_calling/06.GVCF_filter/output/HND.SNV.recode.vcf --output-folder /work/share/acuwbf4fll/liucan/HND_project/Bulk_RNA_variant_calling/09.phylotree --output-prefix HND_RNA_SNV

image

I did not found any parameters specified to set the output type, but I prefer the nucelotide sequences alignment to be output. How can I do for this?

Any suggestions would be greatly appreciated!

edgardomortiz commented 1 month ago

They are nucleotides, VCF doesn't support aminoacids as far as I know. Your heterozygous genotypes are represented with ambiguity codes, see here https://www.promega.com/resources/guides/nucleic-acid-analysis/restriction-enzyme-resource/restriction-enzyme-resource-tables/iupac-ambiguity-codes-for-nucleotide-degeneracy/

Edgardo

LiuCanidk commented 1 month ago

They are nucleotides, VCF doesn't support aminoacids as far as I know. Your heterozygous genotypes are represented with ambiguity codes, see here https://www.promega.com/resources/guides/nucleic-acid-analysis/restriction-enzyme-resource/restriction-enzyme-resource-tables/iupac-ambiguity-codes-for-nucleotide-degeneracy/

Edgardo

@edgardomortiz I see. Thanks for your reply! I then wonder if it is normal that my translated phylip file was filled with ambigous code and whether this wound affect the process of tree construction. If so, then should I enable the parameter of --resolve-IUPAC to choose one nucleotide forcely?

edgardomortiz commented 1 month ago

I don't think it is a good idea to translate SNPs, they are not contiguous in the genome. Besides that, degenerate nucleotides will create degenerate aminoacids as well during translation. The option --resolve-IUPAC will choose one nucleotide at random when you have an ambiguity, you may try that but I think I won't fix your issue of trying to translate SNPs (unless I am missing something about your specific VCF).

I hope this makes sense,

Edgardo

LiuCanidk commented 1 month ago

I don't think it is a good idea to translate SNPs, they are not contiguous in the genome. Besides that, degenerate nucleotides will create degenerate aminoacids as well during translation. The option --resolve-IUPAC will choose one nucleotide at random when you have an ambiguity, you may try that but I think I won't fix your issue of trying to translate SNPs (unless I am missing something about your specific VCF).

I hope this makes sense,

Edgardo

Thanks for your reply. By stating "translating SNPs", I mean translating from the VCF format to a format of alignment, e.g., phylip format, for tree construction. There may be some misleading that I did not mean translating from nucleotides to amino acid sequences. Sorry about that.

I agree that SNPs are discontinuous in the genome. I am just wondering why I got so many ambiguous sequences from VCF format and whether I should add the -resolve-IUPAC parameter to avoid this situation. That is, would too many ambiguous sequences hamper the downstream analysis of tree construction?

Thanks in advance

edgardomortiz commented 1 month ago

Ah I see, you meant converting VCF to another format (sorry for being pedantic but translating has a biological meaning and I got confused). As I said above, you have heterozygous genotypes because I assume your organism is at least diploid. For phylogenetics it is common to use a single sequence per sample, the way to achieve this is by representing both possible nucleotides with a single ambiguity code. As for the consequences of these ambiguities on your data I can't predict them because I am obviously not familiar with the organisms you are analyzing, but in general I could say the more ambiguities the less resolved a tree ends up.

Maybe your SNP calling settings were set up incorrectly? Maybe your reference genome is too distant? I don't know, I am just speculating here...

Edgardo

LiuCanidk commented 1 month ago

Oh, sorry about the information loss. The organism is human, and more specificly, the material is a cancer cell line and of course with some treatments.

I checked the VCF file and did find something weird: some genotypes are missing, maybe it is the cause and may be due to hard genotype filtration. image

However, I wonder whether what you said about representing both possible nucleotides with a single ambiguity code could work. How can I achieve this?

edgardomortiz commented 1 month ago

However, I wonder whether what you said about representing both possible nucleotides with a single ambiguity code could work. How can I achieve this?

This is what the script does by default, the reason you have the ambiguity codes in the first place. No need to do anything additional...