abhi18av / drug-resistance-prediction-cambiohack

Predicting drug resistance using H20.ai
https://cambiohack.uk/drug-resistance-prediction-using-wgs-data
MIT License
2 stars 4 forks source link

Data visualization for the VCF files. #4

Closed dkiberu closed 3 years ago

dkiberu commented 3 years ago

This is a notebook with python code for visualization of the distribution of vcf file features; QUAL, POS, and DP(Depth). We need to relate these with the REF and ALT features if possible. ps. The vcf files were combined into one (comb.vcf) using bash.

abhi18av commented 3 years ago

Thanks @dkiberu - let me have a look and then we can have a discussion for the next steps 👍

abhi18av commented 3 years ago

Hi @dkiberu ,

I have seen the notebook now and this is a good start of the univariate analysis!

I'm curious, could you please share the script you used to concatenate the VCF files? I have also managed to do this via the GATK tool, by relying on the GVCF format it produces. See here https://github.com/abhi18av/drug-resistance-prediction-cambiohack/blob/78f45ec164943fd0f535c6f314081a038db53a14/_scratch/nyu_gatk.sh#L38

In case the merge algorithm differs, could you please explore the analysis on this file today? (this was generated using GATK based automated merging)

https://github.com/abhi18av/drug-resistance-prediction-cambiohack/blob/9276df76646c74100cbbdb6344f4a0cb9b53dcea/_resources/synced/snpFiles/cohort.bqsr.filter.snps.vcf

Perhaps now we could also explore the INFO and the FORMAT fields in the VCF file and separate out the values corresponding to each genome. For example, the following chunk of the VCF table

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ERR3148226  ERR3148228
NC000962_3  1977    .   A   G   3401.13 PASS    AC=4;AF=1.00;AN=4;DP=93;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=1.046    GT:AD:DP:GQ:PL  1/1:0,71:71:99:2688,214,0   1/1:0,19:19:57:729,57,0

... could be processed to be like

ERR3148226_INFO_AC  ERR3148226_INFO_AF      ERR3148226_FORMAT_GT

4                                       1               1/1

Moreover, I think that, beyond the uni-variate approach; we can explore other data visualization approaches such as bi-variate or multi-variate after the above transformation.

dkiberu commented 3 years ago

Hi Sharma, sorry I couldn't get to this today, hopefully I will tomorrow. As for the script I just removed the headers and merged all files:

$ grep -v '#' gatkVcfs/* > comb.vcf

I however think ur approach is more sound.

On Tue, Sep 22, 2020, 8:16 AM Abhinav Sharma notifications@github.com wrote:

Hi @dkiberu https://github.com/dkiberu ,

I have seen the notebook now and this is a good start of the univariate analysis!

I'm curious, could you please share the script you used to concatenate the VCF files? I have also managed to do this via the GATK tool, by relying on the GVCF format it produces. See here https://github.com/abhi18av/drug-resistance-prediction-cambiohack/blob/78f45ec164943fd0f535c6f314081a038db53a14/_scratch/nyu_gatk.sh#L38

In case the merge algorithm differs, could you please explore the analysis on this file today?

https://github.com/abhi18av/drug-resistance-prediction-cambiohack/blob/9276df76646c74100cbbdb6344f4a0cb9b53dcea/_resources/synced/snpFiles/cohort.bqsr.filter.snps.vcf

Perhaps now we could also explore the INFO and the FORMAT fields in the VCF file and separate out the values corresponding to each genome. For example, the following chunk of the VCF table

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ERR3148226 ERR3148228

NC000962_3 1977 . A G 3401.13 PASS AC=4;AF=1.00;AN=4;DP=93;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=1.046 GT:AD:DP:GQ:PL 1/1:0,71:71:99:2688,214,0 1/1:0,19:19:57:729,57,0

... could be processed to be like

ERR3148226_INFO_AC ERR3148226_INFO_AF ERR3148226_FORMAT_GT

4 1 1/1

Moreover, I think that, beyond the uni-variate approach; we can explore other data visualization approaches such as bi-variate or multi-variate after the above transformation.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/abhi18av/drug-resistance-prediction-cambiohack/pull/4#issuecomment-696512430, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANMVUWJAKBS2WJAV4FKHS4LSHAXMDANCNFSM4RUV2VRA .

abhi18av commented 3 years ago

I understand @dkiberu , it's okay.

Let's make a final push today :)