Open mikelove opened 2 years ago
Can we see the output from the patch command with the debug flag turned on?
Try just looking at chromosome 17:
# create just a Fasta file for chromosome 17
g2gtools extract -i fasta/GRCm38_68.fa -id 17 > fasta/GRCm38_68.17.fa
# run patch on just that Fasta file with debug (-d) turned on
g2gtools patch -p 12 -i fasta/GRCm38_68.17.fa -c snps/CAST_EiJ.vci.gz \
-o fasta/CAST_EiJ.17.patched.fa -d 2> fasta/CAST_EiJ.17.patched.log
I'd like to see the contents of fasta/CAST_EiJ.17.patched.log
While testing chr 17, I think I may have found the issue. I think an out-of-date .tbi
file sitting alongside CAST_EiJ.vci.gz
is the culprit. Not sure why vcf2vci isn't replace the index. I'm going to test now if adding tabix -p vcf <strain>.vci.gz
after vcf2vci and before patch alleviates the issue.
This appears to have been the issue.
Either deleting the stale .tbi
before running vcf2vci, or forcing a new index to be created with tabix -p vcf <strain>.vci.gz
, now I get non-zero SNPs.
So it appears vcf2vci followed by patch won't overwrite an old .tbi
file, although it will overwrite the old vci.gz
.
==> fasta/129S1_SvImJ.patched.log <==
[g2gtools] Patched 5,173,412 SNPs total
[g2gtools] Patch complete: 00:00:44.89
==> fasta/CAST_EiJ.patched.log <==
[g2gtools] Patched 20,668,274 SNPs total
[g2gtools] Patch complete: 00:00:50.04
Similar to #19, but in this case, the same chromosome names are used, so that shouldn't be the issue. I'm using the genome that was used to create the VCF files, as provided by Sanger.
Using g2gtools 0.2.7 installed via
conda install -c kbchoi g2gtools=0.2.7=py36_0
.If I download these VCF and FASTA for CAST:
ftp://ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/strain_specific_vcfs/CAST_EiJ.mgp.v5.snps.dbSNP142.vcf.gz ftp://ftp-mouse.sanger.ac.uk/REL-1505-SNPs_Indels/strain_specific_vcfs/CAST_EiJ.mgp.v5.indels.dbSNP142.normed.vcf.gz ftp://ftp-mouse.sanger.ac.uk/ref/GRCm38_68.fa
And then start the first two steps to create a transcriptome:
The log output says:
And the output FASTA is equivalent to reference.
If I exclude the indels file, I get:
I've tried re-ordering the SNP and indels file, and I've tried with and without the pass or quality filters. I've also tried merging the two VCF into one file (removing the header of the second one).