Atkinson-Lab / Tractor

Scripts for implementing the Tractor pipeline
MIT License
44 stars 5 forks source link

Index error using imputed VCF when extracting tracts #31

Closed silviaadiz closed 3 months ago

silviaadiz commented 6 months ago

Hi! I am encountering an error for which a few issues have already been raised, but I have been trying to troubleshoot it and still haven't worked it out. The thing is I am using imputed files (from TopMed), but they have been filtered (by MAF and INFO) using PLINK. RFMix handled these vcf without problems, but when running the ExtractTracts.py, I get this message:

File "/mnt/lustre/scratch/nlsas/home/usc/gb/sdd/lat23/TRACTOR/Tractor/scripts/ExtractTracts.py", line 126, in extract_tracts geno_b = str(geno[1])

This is the VCF header:

fileformat=VCFv4.3

fileDate=20231123

source=PLINKv2.00

filedate=2023.3.13

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

pipeline=michigan-imputationserver-1.7.1

imputation=minimac4-1.0.2

phasing=eagle-2.4

panel=apps@topmed-r2@1.0.0

r2Filter=0.3

contig=

FORMAT=

Sample genotypes are are split in columns by "\t", and genotype calls are separated by "|". It works fine when using the raw files from imputation instead (without filtering), but it is taking a lot of time just to run chr22 (and the output files are also very heavy). I have tried to modify the script in line 87 in case the problem was the "\t" separator between samples, but it does still throw the error. I would much appreciate your help here!

Thank you! :)

nirav572 commented 6 months ago

Hi @silviaadiz,

There must be more to that error, can you share the complete error you get when running the script? Also, would it be possible to share a snapshot of the VCF file?

silviaadiz commented 6 months ago

Hi! Thanks for the quick reply.

This is the full error:

INFO (main 42): Creating output files for 3 ancestries INFO (main 48): Opening input and output files for reading and writing INFO (main 117): VCF position, 13014 is not in an msp window, skipping site INFO (main 117): VCF position, 13104 is not in an msp window, skipping site INFO (main 117): VCF position, 13105 is not in an msp window, skipping site INFO (main 117): VCF position, 13119 is not in an msp window, skipping site INFO (main 117): VCF position, 13150 is not in an msp window, skipping site INFO (main 117): VCF position, 13167 is not in an msp window, skipping site INFO (main 117): VCF position, 13192 is not in an msp window, skipping site INFO (main 117): VCF position, 13222 is not in an msp window, skipping site INFO (main 117): VCF position, 13293 is not in an msp window, skipping site INFO (main 117): VCF position, 13301 is not in an msp window, skipping site INFO (main 117): VCF position, 13311 is not in an msp window, skipping site Traceback (most recent call last): File "/mnt/lustre/scratch/nlsas/home/usc/gb/sdd/lat23/TRACTOR/Tractor/scripts/ExtractTracts.py", line 184, in extract_tracts(**vars(args)) File "/mnt/lustre/scratch/nlsas/home/usc/gb/sdd/lat23/TRACTOR/Tractor/scripts/ExtractTracts.py", line 126, in extract_tracts geno_b = str(geno[1]) IndexError: list index out of range

FYI- so far I haven't seen the "skipping site" message when using the full VCF. This is how the filtered VCF looks like, if this screenshot is not enough I can share more with you by email:

image (7)

nirav572 commented 6 months ago

Hi @silviaadiz,

I was unable to replicate the error, however, we have recently updated the scripts. Can you test again with the updated scripts, if the error persists, please email me with a small snippet of your VCF file at nirav.shah@bcm.edu so that I can replicate the error.

silviaadiz commented 5 months ago

Hi! Sorry for the late reply, I haven't been able to work on this until recently. Thank you for your help. I have run the new scripts but I still got the error, so I'm going to prepare a chunk of my VCF and send it to you. It might be related to how PLINK does the conversion to VCF, so I will also filter them with bcftools and check how that goes.

Thank you again, Silvia

nirav572 commented 4 months ago

Any updates @silviaadiz?

nirav572 commented 3 months ago

The issue was resolved via email. The error was not caused by the imputed data, but rather by occasional unphased genotypes present in a file that appeared to contain phased genotypes.