Closed dkiberu closed 3 years ago
Thanks @dkiberu - let me have a look and then we can have a discussion for the next steps 👍
Hi @dkiberu ,
I have seen the notebook now and this is a good start of the univariate analysis!
I'm curious, could you please share the script
you used to concatenate the VCF files? I have also managed to do this via the GATK
tool, by relying on the GVCF
format it produces. See here https://github.com/abhi18av/drug-resistance-prediction-cambiohack/blob/78f45ec164943fd0f535c6f314081a038db53a14/_scratch/nyu_gatk.sh#L38
In case the merge
algorithm differs, could you please explore the analysis on this file today? (this was generated using GATK based automated merging)
Perhaps now we could also explore the INFO
and the FORMAT
fields in the VCF
file and separate out the values corresponding to each genome. For example, the following chunk of the VCF table
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ERR3148226 ERR3148228
NC000962_3 1977 . A G 3401.13 PASS AC=4;AF=1.00;AN=4;DP=93;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=1.046 GT:AD:DP:GQ:PL 1/1:0,71:71:99:2688,214,0 1/1:0,19:19:57:729,57,0
... could be processed to be like
ERR3148226_INFO_AC ERR3148226_INFO_AF ERR3148226_FORMAT_GT
4 1 1/1
Moreover, I think that, beyond the uni-variate
approach; we can explore other data visualization approaches such as bi-variate
or multi-variate
after the above transformation.
Hi Sharma, sorry I couldn't get to this today, hopefully I will tomorrow. As for the script I just removed the headers and merged all files:
$ grep -v '#' gatkVcfs/* > comb.vcf
I however think ur approach is more sound.
On Tue, Sep 22, 2020, 8:16 AM Abhinav Sharma notifications@github.com wrote:
Hi @dkiberu https://github.com/dkiberu ,
I have seen the notebook now and this is a good start of the univariate analysis!
I'm curious, could you please share the script you used to concatenate the VCF files? I have also managed to do this via the GATK tool, by relying on the GVCF format it produces. See here https://github.com/abhi18av/drug-resistance-prediction-cambiohack/blob/78f45ec164943fd0f535c6f314081a038db53a14/_scratch/nyu_gatk.sh#L38
In case the merge algorithm differs, could you please explore the analysis on this file today?
Perhaps now we could also explore the INFO and the FORMAT fields in the VCF file and separate out the values corresponding to each genome. For example, the following chunk of the VCF table
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ERR3148226 ERR3148228
NC000962_3 1977 . A G 3401.13 PASS AC=4;AF=1.00;AN=4;DP=93;ExcessHet=3.0103;FS=0.000;MLEAC=4;MLEAF=1.00;MQ=60.00;QD=25.36;SOR=1.046 GT:AD:DP:GQ:PL 1/1:0,71:71:99:2688,214,0 1/1:0,19:19:57:729,57,0
... could be processed to be like
ERR3148226_INFO_AC ERR3148226_INFO_AF ERR3148226_FORMAT_GT
4 1 1/1
Moreover, I think that, beyond the uni-variate approach; we can explore other data visualization approaches such as bi-variate or multi-variate after the above transformation.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/abhi18av/drug-resistance-prediction-cambiohack/pull/4#issuecomment-696512430, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANMVUWJAKBS2WJAV4FKHS4LSHAXMDANCNFSM4RUV2VRA .
I understand @dkiberu , it's okay.
Let's make a final push today :)
This is a notebook with python code for visualization of the distribution of vcf file features; QUAL, POS, and DP(Depth). We need to relate these with the REF and ALT features if possible. ps. The vcf files were combined into one (comb.vcf) using bash.