Predicting drug resistance using Machine Learning
I've documented some ideas here
https://github.com/abhi18av/drug-resistance-prediction-cambiohack/projects/1
Download and prepare the variant calling and drug resistance results - DONE
Download all VCF
files for samples - DONE
Download results of tb-profiler
for these samples - DONE
Syncronize these files as per common genome IDs - DONE
Filter out SNP from the synced VCFs
- DONE
Filter out resistance and lineage oriented fields from synced tb-profiler
- DONE
Merge the filtered SNP from VCF
files - DONE
You can get the results of this stage through this link
https://1drv.ms/u/s!AtDyzJXLzSCVgaBRAOeffZf3Zi6QtA?e=bwp8P5
Do feature engineering to obtain a format suitable for machine learning
Split the final dataset into test-train data (30/70 split)
Train the Random Forest
algorithm on the training dataset
Check the accuracy as per AUC metric
Iterate on steps 7 - 10 till satisfactory results are achieved