d3b-center / ticket-tracker-OPC

A repo to generate and track tickets for ped OT
2 stars 0 forks source link

Update DGD fusion file to match GENCODE v39 symbols + run FusionAnnotator on it #490

Closed jharenza closed 1 year ago

jharenza commented 1 year ago

What data file(s) does this issue pertain to?

dgd fusion file

What release are you using?

v12

Put your question or report your issue here.

We will need to harmonize this fusion file to GENCODE v39. @migbro any ideas on how we will do this with incoming DGD/archer fusion panel using old gene symbols - can we create a workflow upstream of the annofuse/fusion annotator workflow for DGD fusion to do updated gene symbol matching?

cc @zhangb1 @ewafula @chinwallaa @aadamk

migbro commented 1 year ago

Hmm, well, I think ENSEMBL/GENCODE uses HGNC gene symbols, which I think is the authority behind the HUGO symbols. So, looking into finding a converter where we could write a tool to extract the existing ones, query, return the new ones and replace. They have a biomart server, also see this repo: https://github.com/HGNC/get-gene-info

What's the timeline on this?

migbro commented 1 year ago

Or, we could just write at tool that takes one of these files: https://www.genenames.org/download/archive/ at the end there is the current symbol mapping in tabular and json format, and have it be like reference. So, we'd load the archer and reference into like a pandas dataframe, join on the symbols (may need to manipulate reference a little), and replace with new ones.

jharenza commented 1 year ago

The second option might be quicker - timeline asap for v12 release and compatibility with PBTA. Cc @aadamk

migbro commented 1 year ago

Ha, idk when v12 release is, end of month? Can I get an example of the input file? And the idea here is to add an optional cwl tool the to annoFuse wf that will update the gene symbols, then run the annoFuse, like we do for production runs? Or would it be standalone?

migbro commented 1 year ago

Ok, assigning @dmiller15 as he's hit a new blocker in the canine dev... To recap, we will write a tool that will update the gene symbols. The best route seems to be to either use the tsv or json to look up the old gene symbol, update it with the new gene symbol. In the short term, we can probably write generically that the input is tsv, and have the user specify which column to update. The tool will output a version of the input in which the old gene symbols have been updated, where possible. If Matt could be added, he could give us an example of input files, otherwise, @zhangb1 at @jharenza , where can we get that?

jharenza commented 1 year ago

Here is the file: s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v12/fusion-dgd.tsv.gz

dmiller15 commented 1 year ago

A quick review of the the document that @migbro provided reveals that some of the old symbols correlate to many different new symbols. Here is a complete list:

Old Symbol: ACSM2 has multiple new symbol options: ACSM2A,ACSM2B
Old Symbol: AK3L1 has multiple new symbol options: AK3,AK4
Old Symbol: ALDH7 has multiple new symbol options: ALDH3B1,ALDH9A1
Old Symbol: ALDH4 has multiple new symbol options: ALDH4A1,ALDH9A1
Old Symbol: AMY1 has multiple new symbol options: AMY1A,AMY1B,AMY1C
Old Symbol: AMY2 has multiple new symbol options: AMY2A,AMY2B
Old Symbol: D3F15S2 has multiple new symbol options: APEH,MST1
Old Symbol: DNF15S2 has multiple new symbol options: APEH,MST1
Old Symbol: ATPM has multiple new symbol options: ATP5F1A,ATP5PF
Old Symbol: VPP3 has multiple new symbol options: ATP6V1B1,ATP6V1B2
Old Symbol: ATP6D has multiple new symbol options: ATP6V1C1,ATP6V0D1
Old Symbol: ATP6C has multiple new symbol options: ATP6V1C1,ATP6V0C
Old Symbol: B3GNT1 has multiple new symbol options: B3GNT2,B4GAT1
Old Symbol: STRA13 has multiple new symbol options: BHLHE40,CENPX
Old Symbol: FACD has multiple new symbol options: BRCA2,FANCD2
Old Symbol: FANCD has multiple new symbol options: BRCA2,FANCD2
Old Symbol: C4BP has multiple new symbol options: C4BPA,C4BPB
Old Symbol: C11orf48 has multiple new symbol options: C11orf98,LBHD1
Old Symbol: CD1 has multiple new symbol options: CD1A,CD1B,CD1C
Old Symbol: EBN has multiple new symbol options: CHRNA4,KCNQ2
Old Symbol: EBN1 has multiple new symbol options: CHRNA4,KCNQ2
Old Symbol: CKMT1 has multiple new symbol options: CKMT1A,CKMT1B
Old Symbol: CLECSF13 has multiple new symbol options: CLEC4F,CLEC10A
Old Symbol: CLPSMCR has multiple new symbol options: COTL1P1,COTL1P2
Old Symbol: CPT1 has multiple new symbol options: CPT1A,CPT2
Old Symbol: CYP11B has multiple new symbol options: CYP11B1,CYP11B2
Old Symbol: DNM1DN15@ has multiple new symbol options: DNM1P24,DNM1P25
Old Symbol: DNM1DN16@ has multiple new symbol options: DNM1P26,DNM1P27
Old Symbol: DNM1DN3@ has multiple new symbol options: DNM1P28,DNM1P29,DNM1P30
Old Symbol: DNM1DN4@ has multiple new symbol options: DNM1P30,DNM1P31,DNM1P32
Old Symbol: DNM1DN8@ has multiple new symbol options: DNM1P34,DNM1P35,DNM1P36
Old Symbol: DNM1DN10@ has multiple new symbol options: DNM1P37,DNM1P38
Old Symbol: DNM1DN11@ has multiple new symbol options: DNM1P38,DNM1P39,DNM1P40,DNM1P41,DNM1P43,DNM1P44
Old Symbol: DNM1DN14@ has multiple new symbol options: DNM1P46,DNM1P47
Old Symbol: EIF2 has multiple new symbol options: EIF2S1,EIF2S2
Old Symbol: EIF4F has multiple new symbol options: EIF4A2,EIF4E,EIF4G1
Old Symbol: QARS has multiple new symbol options: EPRS1,QARS1
Old Symbol: FCG2 has multiple new symbol options: FCGR2A,FCGR2B
Old Symbol: FCGR2 has multiple new symbol options: FCGR2A,FCGR2B
Old Symbol: FCGR3 has multiple new symbol options: FCGR3A,FCGR3B
Old Symbol: FCG3 has multiple new symbol options: FCGR3A,FCGR3B
Old Symbol: EVR1 has multiple new symbol options: FZD4,LRP5
Old Symbol: GGT has multiple new symbol options: GGT1,GGT2P
Old Symbol: GGTA1P has multiple new symbol options: GGTA1,GGTA2P
Old Symbol: DFNA3 has multiple new symbol options: GJB2,GJB6
Old Symbol: DFNA2 has multiple new symbol options: GJB3,KCNQ4
Old Symbol: GPRK7 has multiple new symbol options: GRK7,MKNK2
Old Symbol: MNS has multiple new symbol options: GYPA,GYPB
Old Symbol: PLA2L has multiple new symbol options: HHLA1,OC90,PLA2G2A
Old Symbol: HNRPA3 has multiple new symbol options: HNRNPA3,HNRNPA3P1
Old Symbol: HOX1 has multiple new symbol options: HOXA1,HOXA3,HOXA4,HOXA5,HOXA6,HOXA7,HOXA9,HOXA10,HOXA11,HOXA13
Old Symbol: HOX1D has multiple new symbol options: HOXA4,HOXD3
Old Symbol: HOX2 has multiple new symbol options: HOXB1,HOXB2,HOXB3,HOXB4,HOXB5,HOXB6,HOXB7,HOXB8,HOXB9
Old Symbol: HOX3 has multiple new symbol options: HOXC4,HOXC5,HOXC6,HOXC8,HOXC9,HOXC12,HOXC13
Old Symbol: HOX4 has multiple new symbol options: HOXD1,HOXD3,HOXD4,HOXD8,HOXD9,HOXD10,HOXD11
Old Symbol: EDHB17 has multiple new symbol options: HSD17B1,HSD17B1P1
Old Symbol: IBP1 has multiple new symbol options: IGBP1,IGFBP1
Old Symbol: IGLC has multiple new symbol options: IGLC1,IGLC2,IGLC3,IGLC4,IGLC5,IGLC6
Old Symbol: MUM1 has multiple new symbol options: IRF4,PWWP3A
Old Symbol: PRSSL1 has multiple new symbol options: KLK10,PRSS57
Old Symbol: CHS1 has multiple new symbol options: LYST,VPS13B
Old Symbol: MT1 has multiple new symbol options: MT1A,MT1B,MT1E,MT1F,MT1G,MT1H,MT1IP,MT1JP,MT1L,MT1M,MT1X
Old Symbol: OFD1P1 has multiple new symbol options: OFD1P1Y,OFD1P17
Old Symbol: OFD1P2 has multiple new symbol options: OFD1P2Y,OFD1P6Y
Old Symbol: CBBM has multiple new symbol options: OPN1LW,OPN1MW
Old Symbol: OR7E68P has multiple new symbol options: OR7E26P,OR7E110P
Old Symbol: SIL has multiple new symbol options: PMEL,STIL
Old Symbol: PPP1R6 has multiple new symbol options: PPP1R3D,PPP1R9B
Old Symbol: PRKAR2 has multiple new symbol options: PRKAR2A,PRKAR2B
Old Symbol: HRMT1L3 has multiple new symbol options: PRMT3,PRMT8
Old Symbol: RH has multiple new symbol options: RHCE,RHD
Old Symbol: RNU1-5 has multiple new symbol options: RNU1-5P,RNVU1-18
Old Symbol: RNU1-8 has multiple new symbol options: RNU1-8P,RNU1-28P
Old Symbol: RNU12P has multiple new symbol options: RNU12,RNU12-2P
Old Symbol: CFAG has multiple new symbol options: S100A8,S100A9
Old Symbol: DIFF6 has multiple new symbol options: SEPTIN1,SEPTIN2
Old Symbol: NET1 has multiple new symbol options: SLC6A2,SLC6A5
Old Symbol: MADH7 has multiple new symbol options: SMAD6,SMAD7
Old Symbol: MADH6 has multiple new symbol options: SMAD6,SMAD9
Old Symbol: SPANX has multiple new symbol options: SPANXA1,SPANXA2
Old Symbol: SYT14L has multiple new symbol options: SYT14P1,SYT16
Old Symbol: TAF2A has multiple new symbol options: TAF1,TAF10
Old Symbol: ODZ3 has multiple new symbol options: TENM1,TENM3
Old Symbol: TRNP1 has multiple new symbol options: TRL-AAG2-3,TRP-AGG2-5,TRP-AGG2-6
Old Symbol: TRM1 has multiple new symbol options: TRX-CAT1-2,TRX-CAT2-1

@jharenza Is there additional information in the DGD fusion file that can be used to determine which new symbol should be used? Should we be using the symbols at all to determine what to update it to?

jharenza commented 1 year ago

ok, this is a bit worrisome - we should update to whatever is in v39 GENCODE. The major change I expected was C11orf95 --> RELA

dmiller15 commented 1 year ago

So I did a quick check of the genes that have multiple new symbols with the genes in the DGD fusion files and found no overlap. So it's not an issue for this particular case but we'll want to discuss an official solution down the road.

jharenza commented 1 year ago

I meant - that gene id NEEDS to change to be consistent with the rest of the data....

migbro commented 1 year ago

It will - to clarify, what Dan is saying that, some old gene symbols oddly have multiple new symbols that they can be converted to, so it's ambiguous without something like an HGNC ID, or ENST ID. Since GENCODE/ENSEMBL uses the HGNC as their source, I doubt it'd help narrow it down. Luckily, none of the ones in DGD have a symbol that is ambiguous. Even better, since it's a panel, we probably don't have to worry about this, and symbols that can be converted (including the orf) will be!

jharenza commented 1 year ago

Oh! thanks, I misunderstood and thought that paste was from the DGD file! Sounds good, thanks!

dmiller15 commented 1 year ago

Alright I've got some code for this task. Do we have some place we would like it to be stored? If not I can just dump the updated file and relevant code here.

jharenza commented 1 year ago

Thanks @dmiller15.

@aadamk @migbro @luederm - thoughts on where this code will go? I think the steps are:

  1. Export fusions from filemaker pro (or create custom fusion file from misc other project) in the format of the file in this comment.
  2. Run the code @dmiller15 prepped to update gene symbols to GENCODE v39 and annotate with FusionAnnotator. Use this file as input for:
  3. Annofuse QC filtering + annotations - this is something @migbro added capability for after @sakshamphul's updates to annoFuse to take in custom fusion file (via Kids First RNASeq workflow??) - nothing should be removed from this DGD file for QC, but this will add the fields which annotate each gene as oncogene, tumor suppressor, kinase, and/or transcription factor, fields which are utilized within the Molecular Targets Platform tables.

cc @baileyckelly @allisonheath

migbro commented 1 year ago

Hmmm, well, the KF workflow might be a way to go in that annoFuse has a standalone wf (maybe we make it a public app?). There is still some a-synchrony as to what function is used to format fusions, so I may have to rework that tool slightly. With that, I could envision adding that as a tool to the wf, and maintaining the code within the KF RNAseq wf git repo.

migbro commented 1 year ago

On second thought, how and where are DGD fusion files handled now? Like, the KF workflow normally runs the fusion callers, then runs annoFuse wf. So in that context, it makes no sense...so since DGD is run separately and is kind of it's own thing, either a modified version of the KF workflow could be created, or we add this new tool to the "existing" (assuming it does) KF wf. If it does not, we can make it like, a D3b WF and either give it its own repo or stick it in some existing MTP or toolkit repo. Thoughts @jharenza ? Lastly, will we keep getting these fusion panels? I had heard something about DGD adding bulk RNAseq to its repertiore in the near future, so that's more like the KF pipeline

migbro commented 1 year ago

Yeah, I think it should be a D3b repo on second thought. I will create a related ticket.

dmiller15 commented 1 year ago

Friday update:

migbro commented 1 year ago

@jharenza PR open here: https://github.com/d3b-center/D3b-DGD-Collaboration/pull/1

migbro commented 1 year ago

To kind of close this ticket, is the last step for @zhangb1 or someone on the MTP team, to run https://github.com/d3b-center/D3b-DGD-Collaboration/blob/main/workflows/dgd_fusion_annotator_wf.cwl using task https://cavatica.sbgenomics.com/u/d3b-bixu/d3b-dgd-collab-dev/tasks/13699cba-f6a2-44d7-86f4-86f17ef2a06a/ as an example?

jharenza commented 1 year ago

well, you already ran it so this file seems fine for me for v12... but i guess we want to put this into toolkit workflows?

migbro commented 1 year ago

Like copy it over to a certain project? That's fine. Which one? Then from there I guess someone would pick that up and export to a bucket?

jharenza commented 1 year ago

can you gzip, name it fusion-dgd.tsv.gz, and put in v12 OPC release folder on s3 and update md5sum in md5sum.txt?

jharenza commented 1 year ago

s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v12/

migbro commented 1 year ago

Thanks! I'll do this tomorrow

migbro commented 1 year ago

@jharenza File uploaded:

aws s3 ls --no-sign-request  s3://d3b-openaccess-us-east-1-prd-pbta/open-targets/v12/
                           PRE methyl-pre-merge/
                           PRE mtp-tables/
2022-07-27 21:23:53          0
2023-01-25 21:29:31    3971661 UCSC_hg19-GRCh37_Ensembl2RefSeq.tsv
2023-01-26 09:46:08       6036 fusion-dgd.tsv.gz
2022-08-22 16:28:01 1548794683 gene-expression-rsem-tpm-collapsed-subset.rds
2023-01-25 21:29:31   10056262 infinium-annotation-mapping.tsv
2023-01-25 21:29:31  172843499 infinium-methylationepic-v-1-0-b5-manifest-file-csv.zip
2023-01-26 09:47:47        496 md5sum.txt
2022-07-27 21:47:04   20471487 snv-dgd.maf.tsv.gz
2022-08-22 16:28:01 1009047667 tcga-gene-expression-rsem-tpm-collapsed-subset.rds

Old md5 value was deleted and updated with the new one. Clsing ticket - can reopen if something is off