IDEELResearch / vcfdo

Some tools for processing and annotating VCF files, with focus on Plasmodium spp
MIT License
6 stars 4 forks source link

About the fasta file in polarize command #5

Closed smallfishcui closed 5 years ago

smallfishcui commented 5 years ago

Hi Andrew,

I would like to add Ancestral allele info to my vcf file, so I would be able to get the derived allele for my site frequency spectrum. I was running a command: python /homeappl/home/cuiwang/.local/bin/vcfdo polarize -i SNP_NOmissing_withOG.recode.vcf.gz -f SNP_NOmissing_withOG.recode.min4.fasta -n 10 INFO:vcfutils:Task: polarize INFO:polarize:Connecting to VCF file INFO:polarize:Ancestral genome: <Y56.Nomissing.snps.fasta>

fileformat=VCFv4.2

FILTER=

fileDate=20190605

source="Stacks v2.3e"

INFO=

INFO=

INFO=

INFO=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

FORMAT=

INFO=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

vcfdo_command=polarize -i SNP_NOmissing_withOG.recode.vcf.gz -f Y56.Nomissing.snps.fasta -n 10; pwd=/wrk/cuiwang/DONOTREMOVE/Phrag_RAD/denovo/ustacks/M4m4/popu_K6; timestamp=2019-10-16 03:29:11

INFO=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Y1 Y3 Y4 Y6 Y7 Y9 Y10 Y11 Y12 Y13 Y14 Y15 Y16 Y17_1 Y18 Y19 Y20 Y21 Y22 Y23 Y24 Y25 Y26 Y27 Y28 Y30 Y31 Y32 Y33 Y34 Y35 Y36 Y37 Y38 Y39 Y40 Y41 Y43 Y44 Y45 Y46 Y47 Y48 Y49 Y51_1 Y53 Y55 Y56 Y57

However I get a error message: INFO:polarize:Processing VCF file ... Traceback (most recent call last): File "/homeappl/home/cuiwang/.local/lib/python3.6/site-packages/pyfaidx/init.py", line 1014, in getitem return self.records[rname] KeyError: '36240'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/homeappl/home/cuiwang/.local/bin/vcfdo", line 2152, in args.func(vcf = vcf, **vars(args)) File "/homeappl/home/cuiwang/.local/bin/vcfdo", line 810, in polarize_anc_vs_der og = str(genome[site.CHROM][start:end]) File "/homeappl/home/cuiwang/.local/lib/python3.6/site-packages/pyfaidx/init.py", line 1016, in getitem raise KeyError("{0} not in {1}.".format(rname, self.filename)) KeyError: '36240 not in Y56.Nomissing.snps.fasta.'

I set the fasta file to be the outgroup sequence, and 36240 was a name of a cotig, why the package is complaining the contig is not in the fasta file? Could you please give a hint how should I format the fasta file here?

Thanks, Cui

andrewparkermorgan commented 5 years ago

This error is being raised by the pyfaidx package, which does not have any special restrictions on the formatting of contig names as far as I know. Can you confirm that there is a contig named '36240' (exactly) in your fasta file? Should look like

>36240
ACGCTAGATCGA ...
smallfishcui commented 5 years ago

Hi Andrew,

No the contig name is not in my fasta file. Here is the fasta file: >Y56 TCAGCGGCGTGCATACCAACGCGACTTGTCCGCCGGGMATTAACGTGCTCAGAGACGCCCTATCCCGCCTTATCGCTATGCGAGCGCATTCCTTGGTCGCTACCGCTGGYGCTACGAGTTCGCCAGTCTAKTAAAGATKCGTTCCAATCTGGCCCAGGCACAAGCTCGTATGAAGCGATAGACATCCCTGCAATACGCCTACGCATGACGACTCCTACCCGACGTACGAAGTGCCTTCTTACGGYCCCACGATTAAATATATTGATCGCGACTGTTCACCCTTCAGGCATACAACGACAKCGACATGCCTCCKYTGAGTMTAGTAGAAGTGACACGACKKSGCGGATYTAYSCACACGAGGCAGGAGTCSKMCACYAYGWTCCYCTYGYCTYRCRCAGARCTAMYAGGTTTGACACAGCCCTYGGGAGACCGAGGCAGCGCTTGCCGGGCCTTTTCTACCGACYCAGGCTACTTGGCGCGCGWGTCCAGAAGATGCGTCGCCCGCCT

That's it. I just included the outgroup sequence here as an ancestral state of my target populations. The contig name appeared in the index file (*.tbi) of my vcf file. Again, when I run an input un compressed vcf file, the program was complaing: [W::vcf_parse] Contig '36240' is not defined in the header. (Quick workaround: index the file with tabix.) Traceback (most recent call last): File "/homeappl/home/cuiwang/.local/lib/python3.6/site-packages/pyfaidx/init.py", line 1014, in getitem return self.records[rname] KeyError: '36240'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/homeappl/home/cuiwang/.local/bin/vcfdo", line 2152, in args.func(vcf = vcf, **vars(args)) File "/homeappl/home/cuiwang/.local/bin/vcfdo", line 810, in polarize_anc_vs_der og = str(genome[site.CHROM][start:end]) File "/homeappl/home/cuiwang/.local/lib/python3.6/site-packages/pyfaidx/init.py", line 1016, in getitem raise KeyError("{0} not in {1}.".format(rname, self.filename)) KeyError: '36240 not in Y56.Nomissing.snps.fasta.'

So I compressed the .vcf and tabix it then run it, then I got the error mentioned above.

Cui

smallfishcui commented 5 years ago

Hi again,

Is this fasta file wrong? Basically it is just a concatenation of all the SNPs I have in the vcf file. So my question is how could the programe know where each nucleotide acid belongs to? Am I supposed to have a fasta file with genome coordination as each sequence name?

thanks, Cui

andrewparkermorgan commented 5 years ago

The fasta file to be used with vcfdo polarize is expected to have the same contig names and lengths as the reference sequence used to make the VCF, but (possibly) different nucleotide sequences. Furthermore, IUPAC ambiguity codes (R, W, C, Y, ...) will be ignored and treated as missing bases, I think. vcfdo is designed with whole-genome sequencing in mind and it looks like you are using RADseq or other reduced-representation data -- that's a use case I haven't thought much about.

If you want to use the polarize command you'll need to create a fasta with the sequences of your RADseq contigs, and then modify it with the variant calls for an outgroup sample.

andrewparkermorgan commented 5 years ago

To make it more clear with an example, suppose the reference sequence for contig 'chr1' is as shown:

>chr1
TAA G TCA

and that you have a VCF entry like this, with (diploid) genotype for an outgroup individual you can treat as an approximation to the ancestral sequence:

#CHROM      POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  outgroup_sample
chr1    4   .   G   T   .   PASS    .   GT  T/T

The 'ancestral genome' fasta file should look like

>chr1
TAA T TCA

This all assumes that you have a reference sequence -- which is an assumption that is sort of built into the way variants are reported in the VCF format anyway. For RADseq, the "reference" sequence for each contig might have to be some kind of consensus across all your samples.

Indels have inherent ambiguity in the setting of pairwise alignment, so polarize should ignore them.

smallfishcui commented 5 years ago

Hi Andrew,

Thank you for the detailed explanation! Yes, my data is RADseq, and the reference genome is the consensus across all the samples. I have an outgroup sample which I would like to use as ancestral sequence. I have made a fasta format like this:

36240:17 T 36240:20 C 36240:34 A 36240:36 G 36240:44 C

36240 is the contig name. I will try it out to see if it works!

br, Cui

smallfishcui commented 5 years ago

So I reformed the fasta file and run the program, it still complains about the packages: Traceback (most recent call last): File "/homeappl/home/cuiwang/.local/bin/vcfdo", line 16, in import pybedtools as pbt ImportError: No module named 'pybedtools'

Although the pybedtools is exactly in the same folder with vcfdo. Is it because I don't have the acess to the root for the program installation?

Cui

andrewparkermorgan commented 5 years ago

This sounds like a problem with your Python environment rather than with vcfdo. I recommend installing vcfdo and all the dependencies in a virtual environment (either with a tool like conda or with virtualenv) unless you have root/admin access.