Illumina / GTCtoVCF

Script to convert GTC/BPM files to VCF
Apache License 2.0
41 stars 30 forks source link

reduce manifest to extract subset? #1

Closed stephenturner closed 6 years ago

stephenturner commented 6 years ago

Thanks for making this publicly available!

I'm testing this on some Omni 5 data. Taking about 20 minutes per sample. If I only want to extract a subset of SNPs (e.g., just SNPs with rs numbers), if I reduce the manafest CSV to contain only the SNPs I want in the final VCF, will this run at all (without error), and will this reduce processing/IO time?

KelleyRyanM commented 6 years ago

Hi Stephen, If you are running on GTC (and not generating VCF representation of the manifest alone), the GTC files need to be created with the same manifest that is used as input to the converter. There are a few potential options

stephenturner commented 6 years ago

Thanks for the quick response.

Cut down the manifest content and then regenerate the associated GTC files. A potential issue is that you would need to use the BPM format manifest as input to AutoConvert to generate the GTC files. Creating the BPM format manifest requires Illumina support, but is straightforward.

This sounds attractive on one hand, as it would require less storage space and faster processing time. However, the GTC files that are generated would be missing info that I might need to come back to later on. Necessitating regenerating new GTCs with a different manifest, different versions, etc. This could get ugly.

Alternatively, I could explore an additional option to operate on a subset of the content. Before going down this route, I might want to explore some additional performance profiling on the Omni5 to ensure this would have the desired impact on runtime performance.

This would be superb if there's a performance boost to be had. I'm using https://github.com/Illumina/akt for some analyses, and I don't need to bring in all the variants on the platform for some of the analyses I'm doing. Now I'm using bcftools merge on all the GTCtoVCF created VCF files followed by a bcftools view -T to extract SNPs of interest. But perhaps there's a quicker way.

KelleyRyanM commented 6 years ago

The feature branch here (https://github.com/Illumina/GTCtoVCF/tree/feature/loci_filter) contains a new command-line option to filter out loci from the result. The names are expected to exactly match a name from the manifest. Take a look and let me know if it is helpful.

The performance improvement is not overwhelming. I imagine the biggest performance bottle neck is reading the manifest into memory. This could be alleviated by allowing the software to handle multiple GTC files in one invocation, which would avoid repeatedly reading the manifest from disk. This would be a straightforward change so I may take a look at that.

stephenturner commented 6 years ago

Thanks @KelleyRyanM -- I did see this, and I am going to test this asap. Got caught up in another firefight. Thanks so much for implementing this.

stephenturner commented 6 years ago

Perhaps I'm not invoking this correctly.

I took the original lines from the log file:

...
2018-01-24 16:36:05,698 - GTC converter - WARNING - Reference allele is not queried for locus: rs35918394
2018-01-24 16:36:07,043 - GTC converter - WARNING - Reference allele is not queried for locus: rs2234998
2018-01-24 16:36:15,964 - GTC converter - WARNING - Reference allele is not queried for locus: rs3212608
2018-01-24 16:36:16,315 - GTC converter - WARNING - Reference allele is not queried for locus: rs15878
2018-01-24 16:36:19,036 - GTC converter - WARNING - Reference allele is not queried for locus: rs2308765
2018-01-24 16:36:19,528 - GTC converter - WARNING - Reference allele is not queried for locus: kgp22729018
...

... awked out the last field, put that into a text file, then added --filter-loci bad-ref-alleles.txt to the end of my command. I.e.,

time gtc_to_vcf.py --gtc-file chip_position.gtc --manifest-file mymanifest.bpm --genome-fasta-file /path/to/hs37d5.fa --output-vcf-file chip_position.gtc.vcf --skip-indels --log-file chip_position.gtc.log.txt --filter-loci bad-ref-alleles.txt

But I still see all the same WARNINGs in the log files.

stephenturner commented 6 years ago

Also, back to

I imagine the biggest performance bottle neck is reading the manifest into memory. This could be alleviated by allowing the software to handle multiple GTC files in one invocation, which would avoid repeatedly reading the manifest from disk. This would be a straightforward change so I may take a look at that.

This might be nice. Right now I'm running a script that finds GTC files, prints out the command, and I pipe that through GNU parallel.

#!/bin/bash

REF="/path/to/hs37d5.fa"
MANIFEST="/path/to/manifest.bpm"
BADSNPS="/path/to/badsnps"

for G in "$@"
do
    if [ ${G: -4} != ".gtc" ]; then continue; fi
    echo "time gtc_to_vcf.py --gtc-file $G --manifest-file $MANIFEST --genome-fasta-file $REF --output-vcf-file $G.vcf --skip-indels --log-file $G.log.txt"
done

Then thescript.sh *.gtc | parallel

KelleyRyanM commented 6 years ago

Can you give this another shot? There was an issue such that this would only work if the loci were taken from the beginning of the manifest.

stephenturner commented 6 years ago

beautiful, that did the trick.