churchill-lab / g2gtools

Personal diploid genome creation and coordinate conversion
http://churchill-lab.github.io/g2gtools
21 stars 9 forks source link

Patched 0 SNPs -> an old .tbi index file seems to be culprit. #24

Open mikelove opened 2 years ago

mikelove commented 2 years ago

Similar to #19, but in this case, the same chromosome names are used, so that shouldn't be the issue. I'm using the genome that was used to create the VCF files, as provided by Sanger.

Using g2gtools 0.2.7 installed via conda install -c kbchoi g2gtools=0.2.7=py36_0.

If I download these VCF and FASTA for CAST:

ftp://ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/strain_specific_vcfs/CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz ftp://ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/strain_specific_vcfs/CAST_EiJ.mgp.v5.indels.dbSNP142.normed.vcf.gz ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa

And then start the first two steps to create a transcriptome:

g2gtools vcf2vci -p 12 -i snps/CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz \
  -i snps/CAST_EiJ.mgp.v5.indels.dbSNP142.normed.vcf.gz \
  -o snps/CAST_EiJ.vci -s CAST_EiJ -f fasta/GRCm38_68.fa --pass --quality

g2gtools patch -p 12 -i fasta/GRCm38_68.fa -c snps/CAST_EiJ.vci.gz \
  -o fasta/CAST_EiJ.patched.fa 2> fasta/CAST_EiJ.patched.log

The log output says:

==> fasta/CAST_EiJ.patched.log <==
[g2gtools] Patched 0 SNPs total
[g2gtools] Patch complete: 00:00:14.33

And the output FASTA is equivalent to reference.

If I exclude the indels file, I get:

==> fasta/CAST_EiJ.patched.log <==
[g2gtools] Patched 552,805 SNPs total
[g2gtools] Patch complete: 00:00:18.53

I've tried re-ordering the SNP and indels file, and I've tried with and without the pass or quality filters. I've also tried merging the two VCF into one file (removing the header of the second one).

mattjvincent commented 2 years ago

Can we see the output from the patch command with the debug flag turned on?

Try just looking at chromosome 17:

# create just a Fasta file for chromosome 17
g2gtools extract -i fasta/GRCm38_68.fa -id 17 > fasta/GRCm38_68.17.fa

# run patch on just that Fasta file with debug (-d) turned on
g2gtools patch -p 12 -i fasta/GRCm38_68.17.fa -c snps/CAST_EiJ.vci.gz \
  -o fasta/CAST_EiJ.17.patched.fa -d 2> fasta/CAST_EiJ.17.patched.log

I'd like to see the contents of fasta/CAST_EiJ.17.patched.log

mikelove commented 2 years ago

While testing chr 17, I think I may have found the issue. I think an out-of-date .tbi file sitting alongside CAST_EiJ.vci.gz is the culprit. Not sure why vcf2vci isn't replace the index. I'm going to test now if adding tabix -p vcf <strain>.vci.gz after vcf2vci and before patch alleviates the issue.

mikelove commented 2 years ago

This appears to have been the issue.

Either deleting the stale .tbi before running vcf2vci, or forcing a new index to be created with tabix -p vcf <strain>.vci.gz, now I get non-zero SNPs.

So it appears vcf2vci followed by patch won't overwrite an old .tbi file, although it will overwrite the old vci.gz.

==> fasta/129S1_SvImJ.patched.log <==
[g2gtools] Patched 5,173,412 SNPs total
[g2gtools] Patch complete: 00:00:44.89

==> fasta/CAST_EiJ.patched.log <==
[g2gtools] Patched 20,668,274 SNPs total
[g2gtools] Patch complete: 00:00:50.04