Closed abhi18av closed 4 years ago
For some more options see Kouchaki2020 (materials & methods - 2.6 feature spaces).
@jamesbaye , thanks for the pointer!
Yesterday, I managed to complete the merge process. Here's a sample,
https://github.com/abhi18av/drug-resistance-prediction-cambiohack/pull/8
Today we can take this opportunity to try out some feature engineering approaches and complete a couple iterations of model development.
@jamesbaye , this is a good resource for understanding the important fields from the VCF
file.
https://samtools.github.io/hts-specs/VCFv4.2.pdf
INFO
on page-5 is a good point to start interpreting the data and to understand whether we could create features out of some of these fields like ERRXXX_INFO_GT
etc.
Here's a sample on how we can build upon the cohort SNP
analysis
# PyVCF package
import vcf
vcf_reader = vcf.Reader(open("./cohort.bqsr.filter.snps.vcf"), 'r')
for record in vcf_reader:
print(record.POS, " : ", record.alleles, " : ", record.samples[0].gt_alleles)
And here's the output
POS : ALLELES : ERRxxxx
1977 : ['A', G] : ['1', '1']
4013 : ['T', C] : ['1', '1']
5780 : ['G', C] : ['1', '1']
7362 : ['G', C] : ['1', '1']
7585 : ['G', C] : ['1', '1']
Now, we can think about how to really transform this into a suitable vector per genome sample.
So, I think I've managed to come up with an initial approach for iteration-1 of machine learning stage.
The above data could be transformed in the following format
SAMPLE_ID : 1977_REF_A : 1977_ALT_A : 1977_REF_G : 1977_ALT_G : 1977_REF_T : 1977_ALT_T : 1977_REF_C : 1977_REF_T : 4033...
ERRxxxx : 1 : 0 : 1 : 0 : 0 : 0 : 0 : ...
SRRxxxx : 0 : 1 : 0 : 0 : 0 : 0 : 0 : ...
.
.
.
For each one of these rows, there's a corresponding status of Drug Resistance such as Susceptible (S) | Mono drug-resistant (MR) | Multi drug-resistant (MDR) | Extremely drug-resistant (XDR)
which forms our target label. Therefore, this is essentially a multi-class/label classification problem.
Please note that, I've ignored the complications in the ALLELES
columns and removed the rows with NONE
values for a sample. I'm guessing this is a naive approach, however for iteration-1
, I feel that this is sufficient.
@jamesbaye , now that you have some background on the feature engineering and the choices we need to make, I'm confident that you can build upon this process in a more informed manner. Do keep me posted 👍
@abhi18av Nice work. Can you PR your current code?
I'm not sure I quite understand the purpose of the current merging step with gatk. Can we not just produce a script to reduce each sample file into a feature vector and combine all vectors into a matrix? Or does the gatk merging do something else, such as extra QC or filtration?
Sure @jamesbaye , the GATK script is over here https://github.com/abhi18av/drug-resistance-prediction-cambiohack/blob/78f45ec164943fd0f535c6f314081a038db53a14/_scratch/nyu_gatk.sh
I decided to rely on GATK
since its algorithm is based on the GVCF
format which is an enhancement for the VCF
standard with some metadata. Because of this metadata, it is able to do accurate mapping for a particular position POS
.
Later on, in the process, I think we could also do loci
specific optimizations as well such as higher weightage to mutations in POS belonging to a specific range
.
Can we not just produce a script to reduce each sample file into a feature vector and combine all vectors into a matrix?
Yes, this is possible and I'm curious to see how @dkiberu did that. See here https://github.com/abhi18av/drug-resistance-prediction-cambiohack/pull/4#issue-490388996
I'm thinking that we can make the feature space more compact if each dimension is the number of ALT
in that sample at a given POS
(0 if no snp, 1 if het, 2 if homozygous). We could group by POS
only, or by POS
and ALT
.
Taking the first lines of the vcf as an example:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ERR3148226 ERR3148228
NC000962_3 1977 . A G 3401.13 PASS AC=4;AF=1.00;AN=4;DP=93;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=1.046 GT:AD:DP:GQ:PL 1/1:0,71:71:99:2688,214,0 1/1:0,19:19:57:729,57,0
NC000962_3 4013 . T C 5721.13 PASS AC=4;AF=1.00;AN=4;DP=170;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=34.46;SOR=0.742 GT:AD:DP:GQ:PL 1/1:0,112:112:99:3903,334,0 1/1:0,54:54:99:1834,162,0
NC000962_3 5780 . G C 3006.48 PASS AC=2;AF=0.500;AN=4;BaseQRankSum=-7.940e-01;DP=124;ExcessHet=0.7918;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=34.16;ReadPosRankSum=-8.480e-01;SOR=0.319 GT:AD:DP:GQ:PL 1/1:1,87:88:99:3022,222,0 0/0:35,0:35:99:0,102,1530
NC000962_3 7362 . G C 5977.13 PASS AC=4;AF=1.00;AN=4;DP=183;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=33.39;SOR=0.774 GT:AD:DP:GQ:PL 1/1:0,123:123:99:4252,368,0 1/1:0,56:56:99:1741,168,0
NC000962_3 7585 . G C 6891.13 PASS AC=4;AF=1.00;AN=4;DP=210;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=34.80;SOR=0.734 GT:AD:DP:GQ:PL 1/1:0,156:156:99:5480,468,0 1/1:0,42:42:99:1427,126,0
NC000962_3 9304 . G A 6843.13 PASS AC=4;AF=1.00;AN=4;DP=204;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=34.39;SOR=0.877 GT:AD:DP:GQ:PL 1/1:0,167:167:99:5766,497,0 1/1:0,32:32:96:1093,96,0
NC000962_3 11370 . C T 5701.13 PASS AC=4;AF=1.00;AN=4;DP=174;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=33.34;SOR=0.996 GT:AD:DP:GQ:PL 1/1:0,133:133:99:4309,390,0 1/1:0,38:38:99:1408,114,0
NC000962_3 11879 . A G 6404.13 PASS AC=4;AF=1.00;AN=4;DP=195;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=28.73;SOR=1.759 GT:AD:DP:GQ:PL 1/1:0,140:140:99:4958,418,0 1/1:0,40:40:99:1462,120,0
NC000962_3 13636 . T C 1627.48 PASS AC=2;AF=0.500;AN=4;DP=135;ExcessHet=0.7918;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=60.00;QD=32.55;SOR=1.270 GT:AD:DP:GQ:PL 0/0:83,0:83:99:0,120,1800 1/1:0,50:50:99:1643,149,0
NC000962_3 14079 . A G 1004.48 PASS AC=2;AF=0.500;AN=4;DP=110;ExcessHet=0.7918;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=60.00;QD=30.97;SOR=1.101 GT:AD:DP:GQ:PL 0/0:83,0:83:99:0,120,1800 1/1:0,27:27:81:1020,81,0
NC000962_3 14785 . T C 8998.13 PASS AC=4;AF=1.00;AN=4;DP=268;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=34.74;SOR=0.960 GT:AD:DP:GQ:PL 1/1:0,219:219:99:7722,657,0 1/1:0,40:40:99:1292,120,0
Both samples have homozygous snps for all positions except ERR3148226 at positions 13636 and 14079. So the table would look like, grouping by POS only:
SAMPLEID / POS 1977 4013 ... 11879 13636 14079 14785
ERR3148226 2 2 ... 2 0 0 2
ERR3148228 2 2 ... 2 2 2 2
And if we want to preserve the information of which ALT it is, grouping by POS and ALT:
SAMPLEID / POS_ALT 1977_G 4013_C ... 11879_G 13636_C 14079_G 14785_C
ERR3148226 2 2 ... 2 0 0 2
ERR3148228 2 2 ... 2 2 2 2
@jamesbaye , great insight!
Yes, let's do that!
I think we might need to transpose the matrix and then add the corresponding Resistance
tags from tbprofiler json
and then we're good to go.
The other alternatives would be to simply rely on labels like Homozygous
and Heterozygous
, which would give us binary variables.
@dkiberu simply used the following to concatenate. https://github.com/abhi18av/drug-resistance-prediction-cambiohack/pull/4#issuecomment-696904959
$ grep -v '#' gatkVcfs/* > comb.vcf
Marking this as done - the current generalized approached is modelled here https://github.com/abhi18av/drug-resistance-prediction-cambiohack/blob/master/src/notebooks/feature_engineering.ipynb
There's a lot which we could do in feature engineering, however since the scope of the hackathon is limited - let's focus on completing the process first and then iterate.
Here's the general direction
SNP VCF
files into a matrix and continue for iteration-1