CCRGeneticsBranch / Oncogenomics_v2

Oncogenomics portal version 2
0 stars 0 forks source link

Compare the MAF file that Anney shared on biowulf #16

Open hsienchao opened 1 year ago

hsienchao commented 1 year ago

added AVIA_not_found column. It seems AVIA does not have complete MAF list.

Case Khanlab AVIA Both Khanlab Only AVIA Only AVIA Only Not Found Khanlab Only (%) AVIA Only (%) AVIA Only Not Found (%)
CL0049_OM16-008-FFPE 18776 18779 17844 932 935 922 5.0% 5.0% 4.91%
CL0052_OM16-012-FFPE 2553 3093 2177 376 916 905 14.7% 29.6% 29.26%
CL0082_OM16-043 1233 1517 943 290 574 570 23.5% 37.8% 37.57%
CL0086_OM16-046 6531 6751 5834 697 917 903 10.7% 13.6% 13.38%
CL0263_OM19-055 5236 5348 4410 826 938 923 15.8% 17.5% 17.26%
CL0301_OM19-131 8659 8837 7950 709 887 872 8.2% 10.0% 9.87%
MNH332_OM19-021 2823 2807 1797 1026 1010 993 36.3% 36.0% 35.38%
NCI0243_NCI0243 3681 3924 2895 786 1029 1013 21.4% 26.2% 25.82%
NCI0263_Tumor5 3989 4187 3280 709 907 894 17.8% 21.7% 21.35%
NCI0439_OM19-061 2411 2700 1765 646 935 923 26.8% 34.6% 34.19%
hsienchao commented 1 year ago

Had a meeting with Anney:

The MAF file she provided only includes the variants in khanlab database. But we need the MAF for all the variants so we can use it for future data. She is evaluating the feasibility of this.

cheanney commented 1 year ago

Had a meeting with Anney:

The MAF file she provided only includes the variants in khanlab database. But we need the MAF for all the variants so we can use it for future data. She is evaluating the feasibility of this.

My collegiate Keyur Talsania showed me a tool called Echtvar (https://github.com/brentp/echtvar) that can rapidly filter on allele-frequency To test this tool, I used 11 vcf files from sample CL0049. The filter I used is gnomad_popmax < 0.05. **kept variants are with gnomad_popmax < 0.05

The following are the filter results: ==> CL0049_N1D_E2_HGM3YBGXY.HC_DNASeq.raw.snpEff.log <== [echtvar] evaluated 36533 variants (2562 / second). wrote 2165 variants.

==> CL0049_N1D_PS2_HGM3YBGXY.HC_DNASeq.raw.snpEff.log <== [echtvar] evaluated 1622 variants (679 / second). wrote 122 variants.

==> CL0049_T1D_E2_HGM3YBGXY.HC_DNASeq.raw.snpEff.log <== [echtvar] evaluated 45251 variants (3040 / second). wrote 9468 variants.

==> CL0049_T1D_E2_HGM3YBGXY.MuTect.raw.snpEff.log <== [echtvar] evaluated 12593 variants (984 / second). wrote 12552 variants.

==> CL0049_T1D_E2_HGM3YBGXY.strelka.indels.raw.snpEff.log <== [echtvar] evaluated 114 variants (172 / second). wrote 102 variants.

==> CL0049_T1D_E2_HGM3YBGXY.strelka.snvs.raw.snpEff.log <== [echtvar] evaluated 13442 variants (1039 / second). wrote 13420 variants.

==> CL0049_T1D_PS2_HGM3YBGXY.HC_DNASeq.raw.snpEff.log <== [echtvar] evaluated 2122 variants (780 / second). wrote 570 variants.

==> CL0049_T1D_PS2_HGM3YBGXY.MuTect.raw.snpEff.log <== [echtvar] evaluated 964 variants (444 / second). wrote 958 variants.

==> CL0049_T1D_PS2_HGM3YBGXY.strelka.indels.raw.snpEff.log <== [echtvar] evaluated 10 variants (175 / second). wrote 10 variants.

==> CL0049_T1D_PS2_HGM3YBGXY.strelka.snvs.raw.snpEff.log <== [echtvar] evaluated 1235 variants (529 / second). wrote 1234 variants.

==> CL0049_T1R_T3_HGLVNBGXY.HC_RNASeq.raw.snpEff.log <== [echtvar] evaluated 38134 variants (2713 / second). wrote 9646 variants.

Most of the germline variants got filtered out.

Out of 922 variants in CL0049_OM16-008-FFPE.final.rare.not_found.tsv.xlsx 22 variants are either with gnomad_popmax < 0.05 or missing AF (-1) in gnomad. Following are the AF of the 22 rare variants: chr1 145021150 T C 0.0449555 chr2 133075809 G A -1 chr2 241696840 ATCC ATCCTCC -1 chr3 51422766 G GGAGGAGGAT 0.0022795 chr3 51422766 G GGAGGAGGAT 0.0022795 chr3 73111503 T TTGG 0.0430405 chr8 101730036 T TC 6.65e-05 chr11 6411935 TGCTGGC CGCTGGC -1 chr11 7717219 A T 0.004142 chr11 66512290 G GGGCGGC -1 chr11 66512290 G GGGCGGC -1 chr12 53207583 C CCACCAAAGCCACCAGTGCCGAAACCAGCTCCGAAGCCGCCGG -1 chr14 20665538 C T 0.0461855 chr14 77493794 T TTGC 0.0023415 chr14 92537385 CTTT C -1 chr15 100252709 CCAGCAG CCAG -1 chr19 53792954 G A 0.0479185 chr19 53792955 C T 0.047912 chr19 53792955 C T 0.047912 chr22 37906308 G GCTCCTT 0.045227 chr22 50312548 T TTTC 0.0210935 chrY 21154466 T A -1

826 common variants with gnomad_popmax > 0.05 for example: chr1 1647893 C CTTTCTT 0.498716 chr1 1666251 G A 0.710347 chr1 1684347 C CCCT 0.557789 chr1 5935162 A T 0.891839 chr1 12175729 C T 1 chr1 12820870 T C 0.896068 chr1 12835868 T C 0.803066 chr1 12854530 C G 0.912057

We already built hg19 version of gnomad genome library to use with Echtvar.

cheanney commented 1 year ago

send to Vineela on 5/12 and reminded her on 9/22:

Vineela, Just transferred the library files to

/data/khanlab/projects/vineela/echtvar

[chea@helix echtvar]$ ll

total 4078623

-rwxr-xr-x+ 1 chea chea 1520229285 May 12 09:57 gnomad.genomes.r2.1.1.sites.norm.zip

-rwxr-xr-x+ 1 chea chea 2656279146 May 12 09:55 gnomad.v3.1.2.echtvar.popmax.v2.zip

-rwxr-xr-x+ 1 chea chea 731 May 12 09:52 README

The README file contains the download site and steps for echtvar.

Let me know if you have any questions