Open hsienchao opened 1 year ago
Had a meeting with Anney:
The MAF file she provided only includes the variants in khanlab database. But we need the MAF for all the variants so we can use it for future data. She is evaluating the feasibility of this.
Had a meeting with Anney:
The MAF file she provided only includes the variants in khanlab database. But we need the MAF for all the variants so we can use it for future data. She is evaluating the feasibility of this.
My collegiate Keyur Talsania showed me a tool called Echtvar (https://github.com/brentp/echtvar) that can rapidly filter on allele-frequency To test this tool, I used 11 vcf files from sample CL0049. The filter I used is gnomad_popmax < 0.05. **kept variants are with gnomad_popmax < 0.05
The following are the filter results: ==> CL0049_N1D_E2_HGM3YBGXY.HC_DNASeq.raw.snpEff.log <== [echtvar] evaluated 36533 variants (2562 / second). wrote 2165 variants.
==> CL0049_N1D_PS2_HGM3YBGXY.HC_DNASeq.raw.snpEff.log <== [echtvar] evaluated 1622 variants (679 / second). wrote 122 variants.
==> CL0049_T1D_E2_HGM3YBGXY.HC_DNASeq.raw.snpEff.log <== [echtvar] evaluated 45251 variants (3040 / second). wrote 9468 variants.
==> CL0049_T1D_E2_HGM3YBGXY.MuTect.raw.snpEff.log <== [echtvar] evaluated 12593 variants (984 / second). wrote 12552 variants.
==> CL0049_T1D_E2_HGM3YBGXY.strelka.indels.raw.snpEff.log <== [echtvar] evaluated 114 variants (172 / second). wrote 102 variants.
==> CL0049_T1D_E2_HGM3YBGXY.strelka.snvs.raw.snpEff.log <== [echtvar] evaluated 13442 variants (1039 / second). wrote 13420 variants.
==> CL0049_T1D_PS2_HGM3YBGXY.HC_DNASeq.raw.snpEff.log <== [echtvar] evaluated 2122 variants (780 / second). wrote 570 variants.
==> CL0049_T1D_PS2_HGM3YBGXY.MuTect.raw.snpEff.log <== [echtvar] evaluated 964 variants (444 / second). wrote 958 variants.
==> CL0049_T1D_PS2_HGM3YBGXY.strelka.indels.raw.snpEff.log <== [echtvar] evaluated 10 variants (175 / second). wrote 10 variants.
==> CL0049_T1D_PS2_HGM3YBGXY.strelka.snvs.raw.snpEff.log <== [echtvar] evaluated 1235 variants (529 / second). wrote 1234 variants.
==> CL0049_T1R_T3_HGLVNBGXY.HC_RNASeq.raw.snpEff.log <== [echtvar] evaluated 38134 variants (2713 / second). wrote 9646 variants.
Most of the germline variants got filtered out.
Out of 922 variants in CL0049_OM16-008-FFPE.final.rare.not_found.tsv.xlsx 22 variants are either with gnomad_popmax < 0.05 or missing AF (-1) in gnomad. Following are the AF of the 22 rare variants: chr1 145021150 T C 0.0449555 chr2 133075809 G A -1 chr2 241696840 ATCC ATCCTCC -1 chr3 51422766 G GGAGGAGGAT 0.0022795 chr3 51422766 G GGAGGAGGAT 0.0022795 chr3 73111503 T TTGG 0.0430405 chr8 101730036 T TC 6.65e-05 chr11 6411935 TGCTGGC CGCTGGC -1 chr11 7717219 A T 0.004142 chr11 66512290 G GGGCGGC -1 chr11 66512290 G GGGCGGC -1 chr12 53207583 C CCACCAAAGCCACCAGTGCCGAAACCAGCTCCGAAGCCGCCGG -1 chr14 20665538 C T 0.0461855 chr14 77493794 T TTGC 0.0023415 chr14 92537385 CTTT C -1 chr15 100252709 CCAGCAG CCAG -1 chr19 53792954 G A 0.0479185 chr19 53792955 C T 0.047912 chr19 53792955 C T 0.047912 chr22 37906308 G GCTCCTT 0.045227 chr22 50312548 T TTTC 0.0210935 chrY 21154466 T A -1
826 common variants with gnomad_popmax > 0.05 for example: chr1 1647893 C CTTTCTT 0.498716 chr1 1666251 G A 0.710347 chr1 1684347 C CCCT 0.557789 chr1 5935162 A T 0.891839 chr1 12175729 C T 1 chr1 12820870 T C 0.896068 chr1 12835868 T C 0.803066 chr1 12854530 C G 0.912057
We already built hg19 version of gnomad genome library to use with Echtvar.
send to Vineela on 5/12 and reminded her on 9/22:
Vineela, Just transferred the library files to
/data/khanlab/projects/vineela/echtvar
[chea@helix echtvar]$ ll
total 4078623
-rwxr-xr-x+ 1 chea chea 1520229285 May 12 09:57 gnomad.genomes.r2.1.1.sites.norm.zip
-rwxr-xr-x+ 1 chea chea 2656279146 May 12 09:55 gnomad.v3.1.2.echtvar.popmax.v2.zip
-rwxr-xr-x+ 1 chea chea 731 May 12 09:52 README
The README file contains the download site and steps for echtvar.
Let me know if you have any questions
added AVIA_not_found column. It seems AVIA does not have complete MAF list.