Open LiuCanidk opened 1 month ago
They are nucleotides, VCF doesn't support aminoacids as far as I know. Your heterozygous genotypes are represented with ambiguity codes, see here https://www.promega.com/resources/guides/nucleic-acid-analysis/restriction-enzyme-resource/restriction-enzyme-resource-tables/iupac-ambiguity-codes-for-nucleotide-degeneracy/
Edgardo
They are nucleotides, VCF doesn't support aminoacids as far as I know. Your heterozygous genotypes are represented with ambiguity codes, see here https://www.promega.com/resources/guides/nucleic-acid-analysis/restriction-enzyme-resource/restriction-enzyme-resource-tables/iupac-ambiguity-codes-for-nucleotide-degeneracy/
Edgardo
@edgardomortiz I see. Thanks for your reply! I then wonder if it is normal that my translated phylip file was filled with ambigous code and whether this wound affect the process of tree construction. If so, then should I enable the parameter of --resolve-IUPAC
to choose one nucleotide forcely?
I don't think it is a good idea to translate SNPs, they are not contiguous in the genome. Besides that, degenerate nucleotides will create degenerate aminoacids as well during translation. The option --resolve-IUPAC will choose one nucleotide at random when you have an ambiguity, you may try that but I think I won't fix your issue of trying to translate SNPs (unless I am missing something about your specific VCF).
I hope this makes sense,
Edgardo
I don't think it is a good idea to translate SNPs, they are not contiguous in the genome. Besides that, degenerate nucleotides will create degenerate aminoacids as well during translation. The option --resolve-IUPAC will choose one nucleotide at random when you have an ambiguity, you may try that but I think I won't fix your issue of trying to translate SNPs (unless I am missing something about your specific VCF).
I hope this makes sense,
Edgardo
Thanks for your reply. By stating "translating SNPs", I mean translating from the VCF format to a format of alignment, e.g., phylip format, for tree construction. There may be some misleading that I did not mean translating from nucleotides to amino acid sequences. Sorry about that.
I agree that SNPs are discontinuous in the genome. I am just wondering why I got so many ambiguous sequences from VCF format and whether I should add the -resolve-IUPAC
parameter to avoid this situation. That is, would too many ambiguous sequences hamper the downstream analysis of tree construction?
Thanks in advance
Ah I see, you meant converting VCF to another format (sorry for being pedantic but translating has a biological meaning and I got confused). As I said above, you have heterozygous genotypes because I assume your organism is at least diploid. For phylogenetics it is common to use a single sequence per sample, the way to achieve this is by representing both possible nucleotides with a single ambiguity code. As for the consequences of these ambiguities on your data I can't predict them because I am obviously not familiar with the organisms you are analyzing, but in general I could say the more ambiguities the less resolved a tree ends up.
Maybe your SNP calling settings were set up incorrectly? Maybe your reference genome is too distant? I don't know, I am just speculating here...
Edgardo
Oh, sorry about the information loss. The organism is human, and more specificly, the material is a cancer cell line and of course with some treatments.
I checked the VCF file and did find something weird: some genotypes are missing, maybe it is the cause and may be due to hard genotype filtration.
However, I wonder whether what you said about representing both possible nucleotides with a single ambiguity code could work. How can I achieve this?
However, I wonder whether what you said about representing both possible nucleotides with a single ambiguity code could work. How can I achieve this?
This is what the script does by default, the reason you have the ambiguity codes in the first place. No need to do anything additional...
Hi, thanks for developing this tool
I run the script of vcf2phylip.py successfully but found the output seems to be the amino acid sequences. My code and the screenshort of my output file are as follows:
I did not found any parameters specified to set the output type, but I prefer the nucelotide sequences alignment to be output. How can I do for this?
Any suggestions would be greatly appreciated!