Illumina / GTCtoVCF

Script to convert GTC/BPM files to VCF
Apache License 2.0
41 stars 30 forks source link

call rate filtering #12

Closed stephenturner closed 6 years ago

stephenturner commented 6 years ago

Hi Ryan.

We've run into a use case where we're generating lots of VCFs from a directory full of GTCs then doing some merging and postprocessing from there. Except we're getting derailed by samples that have a low call rate, and presumably bad genotypes at sites that are called (when I say low, I mean like <90%, <70%, etc., really low). I know this could probably be addressed upstream somehow, and certainly downstream - I could check stats on call rate and implement some sample selection with bcftools or something else. But as I'm guessing the call rate is built into the GTC file somewhere itself, could the script be modified with a flag that will skip over any GTC files having a call rate below a threshold defined as an option? Are there better ways of handling this that you could envision?

Thanks,

Stephen

KelleyRyanM commented 6 years ago

Hi Stephen, It should be possible to add call-rate filtering; however, it might make sense for that functionality to exist in a separate script outside of the GTCtoVCF conversion. Since the GTC format is internally indexed and the call rate is pre-computed, this will be a very fast operation and there shouldn't be any computational inefficiency by having this pulled out into a separate step.

The "GenotypeCalls" object in the https://github.com/Illumina/BeadArrayFiles package takes a *.gtc file in the constructor and has a "get_call_rate" method to query the call rate. For simplicity, the GTCtoVCF script has its own copy of the BeadArrayFiles library (https://github.com/Illumina/GTCtoVCF/blob/develop/IlluminaBeadArrayFiles.py), so you can just use that version. I've included a rough example of what this might look like in the attached file. filter_gtc.zip

stephenturner commented 6 years ago

Thanks Ryan, will take a look.