abhi18av / drug-resistance-prediction-cambiohack

Predicting drug resistance using H20.ai
https://cambiohack.uk/drug-resistance-prediction-using-wgs-data
MIT License
2 stars 4 forks source link

Feature Engineering for SNP/ VCF #2

Closed abhi18av closed 4 years ago

abhi18av commented 4 years ago

There's a lot which we could do in feature engineering, however since the scope of the hackathon is limited - let's focus on completing the process first and then iterate.

Here's the general direction

jamesbaye commented 4 years ago

For some more options see Kouchaki2020 (materials & methods - 2.6 feature spaces).

abhi18av commented 4 years ago

@jamesbaye , thanks for the pointer!

Yesterday, I managed to complete the merge process. Here's a sample,

https://github.com/abhi18av/drug-resistance-prediction-cambiohack/pull/8

Today we can take this opportunity to try out some feature engineering approaches and complete a couple iterations of model development.

abhi18av commented 4 years ago

@jamesbaye , this is a good resource for understanding the important fields from the VCF file.

https://samtools.github.io/hts-specs/VCFv4.2.pdf

INFO on page-5 is a good point to start interpreting the data and to understand whether we could create features out of some of these fields like ERRXXX_INFO_GT etc.

abhi18av commented 4 years ago

Here's a sample on how we can build upon the cohort SNP analysis

# PyVCF package
import vcf

vcf_reader = vcf.Reader(open("./cohort.bqsr.filter.snps.vcf"), 'r')

for record in vcf_reader:
    print(record.POS, " : ", record.alleles, " : ", record.samples[0].gt_alleles)

And here's the output

POS   :  ALLELES   :  ERRxxxx
1977  :  ['A', G]  :  ['1', '1']  
4013  :  ['T', C]  :  ['1', '1']  
5780  :  ['G', C]  :  ['1', '1']
7362  :  ['G', C]  :  ['1', '1'] 
7585  :  ['G', C]  :  ['1', '1'] 

Now, we can think about how to really transform this into a suitable vector per genome sample.

abhi18av commented 4 years ago

So, I think I've managed to come up with an initial approach for iteration-1 of machine learning stage.

The above data could be transformed in the following format

SAMPLE_ID : 1977_REF_A : 1977_ALT_A : 1977_REF_G : 1977_ALT_G : 1977_REF_T : 1977_ALT_T : 1977_REF_C : 1977_REF_T : 4033...

ERRxxxx     :    1     :    0    :        1     :          0  :         0  :        0  :          0    : ...

SRRxxxx     :    0     :    1    :        0     :          0  :         0  :        0  :          0    : ...
.
.
.

For each one of these rows, there's a corresponding status of Drug Resistance such as Susceptible (S) | Mono drug-resistant (MR) | Multi drug-resistant (MDR) | Extremely drug-resistant (XDR) which forms our target label. Therefore, this is essentially a multi-class/label classification problem.

Please note that, I've ignored the complications in the ALLELES columns and removed the rows with NONE values for a sample. I'm guessing this is a naive approach, however for iteration-1, I feel that this is sufficient.

@jamesbaye , now that you have some background on the feature engineering and the choices we need to make, I'm confident that you can build upon this process in a more informed manner. Do keep me posted 👍

jamesbaye commented 4 years ago

@abhi18av Nice work. Can you PR your current code?

I'm not sure I quite understand the purpose of the current merging step with gatk. Can we not just produce a script to reduce each sample file into a feature vector and combine all vectors into a matrix? Or does the gatk merging do something else, such as extra QC or filtration?

abhi18av commented 4 years ago

Sure @jamesbaye , the GATK script is over here https://github.com/abhi18av/drug-resistance-prediction-cambiohack/blob/78f45ec164943fd0f535c6f314081a038db53a14/_scratch/nyu_gatk.sh

I decided to rely on GATK since its algorithm is based on the GVCF format which is an enhancement for the VCF standard with some metadata. Because of this metadata, it is able to do accurate mapping for a particular position POS.

Later on, in the process, I think we could also do loci specific optimizations as well such as higher weightage to mutations in POS belonging to a specific range.

Can we not just produce a script to reduce each sample file into a feature vector and combine all vectors into a matrix?

Yes, this is possible and I'm curious to see how @dkiberu did that. See here https://github.com/abhi18av/drug-resistance-prediction-cambiohack/pull/4#issue-490388996

jamesbaye commented 4 years ago

I'm thinking that we can make the feature space more compact if each dimension is the number of ALT in that sample at a given POS (0 if no snp, 1 if het, 2 if homozygous). We could group by POS only, or by POS and ALT.

Taking the first lines of the vcf as an example:

#CHROM      POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ERR3148226  ERR3148228
NC000962_3  1977    .   A   G   3401.13 PASS    AC=4;AF=1.00;AN=4;DP=93;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=1.046    GT:AD:DP:GQ:PL  1/1:0,71:71:99:2688,214,0   1/1:0,19:19:57:729,57,0
NC000962_3  4013    .   T   C   5721.13 PASS    AC=4;AF=1.00;AN=4;DP=170;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=34.46;SOR=0.742   GT:AD:DP:GQ:PL  1/1:0,112:112:99:3903,334,0 1/1:0,54:54:99:1834,162,0
NC000962_3  5780    .   G   C   3006.48 PASS    AC=2;AF=0.500;AN=4;BaseQRankSum=-7.940e-01;DP=124;ExcessHet=0.7918;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=60.00;MQRankSum=0.00;QD=34.16;ReadPosRankSum=-8.480e-01;SOR=0.319    GT:AD:DP:GQ:PL  1/1:1,87:88:99:3022,222,0   0/0:35,0:35:99:0,102,1530
NC000962_3  7362    .   G   C   5977.13 PASS    AC=4;AF=1.00;AN=4;DP=183;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=33.39;SOR=0.774   GT:AD:DP:GQ:PL  1/1:0,123:123:99:4252,368,0 1/1:0,56:56:99:1741,168,0
NC000962_3  7585    .   G   C   6891.13 PASS    AC=4;AF=1.00;AN=4;DP=210;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=34.80;SOR=0.734   GT:AD:DP:GQ:PL  1/1:0,156:156:99:5480,468,0 1/1:0,42:42:99:1427,126,0
NC000962_3  9304    .   G   A   6843.13 PASS    AC=4;AF=1.00;AN=4;DP=204;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=34.39;SOR=0.877   GT:AD:DP:GQ:PL  1/1:0,167:167:99:5766,497,0 1/1:0,32:32:96:1093,96,0
NC000962_3  11370   .   C   T   5701.13 PASS    AC=4;AF=1.00;AN=4;DP=174;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=33.34;SOR=0.996   GT:AD:DP:GQ:PL  1/1:0,133:133:99:4309,390,0 1/1:0,38:38:99:1408,114,0
NC000962_3  11879   .   A   G   6404.13 PASS    AC=4;AF=1.00;AN=4;DP=195;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=28.73;SOR=1.759   GT:AD:DP:GQ:PL  1/1:0,140:140:99:4958,418,0 1/1:0,40:40:99:1462,120,0
NC000962_3  13636   .   T   C   1627.48 PASS    AC=2;AF=0.500;AN=4;DP=135;ExcessHet=0.7918;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=60.00;QD=32.55;SOR=1.270 GT:AD:DP:GQ:PL  0/0:83,0:83:99:0,120,1800   1/1:0,50:50:99:1643,149,0
NC000962_3  14079   .   A   G   1004.48 PASS    AC=2;AF=0.500;AN=4;DP=110;ExcessHet=0.7918;FS=0.000;MLEAC=2;MLEAF=0.500;MQ=60.00;QD=30.97;SOR=1.101 GT:AD:DP:GQ:PL  0/0:83,0:83:99:0,120,1800   1/1:0,27:27:81:1020,81,0
NC000962_3  14785   .   T   C   8998.13 PASS    AC=4;AF=1.00;AN=4;DP=268;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=34.74;SOR=0.960   GT:AD:DP:GQ:PL  1/1:0,219:219:99:7722,657,0 1/1:0,40:40:99:1292,120,0

Both samples have homozygous snps for all positions except ERR3148226 at positions 13636 and 14079. So the table would look like, grouping by POS only:

SAMPLEID / POS    1977    4013     ...     11879    13636    14079    14785
ERR3148226    2    2    ...    2    0     0     2
ERR3148228    2    2    ...    2    2     2     2

And if we want to preserve the information of which ALT it is, grouping by POS and ALT:

SAMPLEID / POS_ALT    1977_G    4013_C     ...     11879_G    13636_C    14079_G    14785_C
ERR3148226    2    2    ...    2    0     0     2
ERR3148228    2    2    ...    2    2     2     2
abhi18av commented 4 years ago

@jamesbaye , great insight!

Yes, let's do that!

abhi18av commented 4 years ago

10 addresses the idea put forward here https://github.com/abhi18av/drug-resistance-prediction-cambiohack/issues/2#issuecomment-696654010

I think we might need to transpose the matrix and then add the corresponding Resistance tags from tbprofiler json and then we're good to go.

The other alternatives would be to simply rely on labels like Homozygous and Heterozygous, which would give us binary variables.

abhi18av commented 4 years ago

@dkiberu simply used the following to concatenate. https://github.com/abhi18av/drug-resistance-prediction-cambiohack/pull/4#issuecomment-696904959

$ grep -v '#' gatkVcfs/* > comb.vcf
abhi18av commented 4 years ago

Marking this as done - the current generalized approached is modelled here https://github.com/abhi18av/drug-resistance-prediction-cambiohack/blob/master/src/notebooks/feature_engineering.ipynb