PathwayCommons / msigdb-to-biopax

MSigDB (human C3 TFT motif gene sets only) to BioPAX Level3 data converter.
0 stars 0 forks source link

TFs are modelled as Rna #2

Closed IgorRodchenkov closed 7 years ago

IgorRodchenkov commented 7 years ago

All the TFs (to be used for TemplateReactionRegulation/controller) are there created as Rna rather than Protein, why?.. Sounds like a mistake.

IgorRodchenkov commented 7 years ago

@emekdemir @ozgunbabur @armish @gbader Do you have any thoughts in this regard? Am I right that Rna controllers should be changed to Proteins?.. A.S.A.P.

IgorRodchenkov commented 7 years ago

Look here.

ozgunbabur commented 7 years ago

MSigDB says that C3 collection contains both miRNA targets and TF targets. We should use an RNA in the former and a Protein in the latter. I did not look at the data in detail though.

On Tue, May 2, 2017 at 9:59 AM, Igor Rodchenkov notifications@github.com wrote:

Look here https://github.com/PathwayCommons/msigdb-to-biopax/blob/master/src/main/java/edu/mit/broad/vdb/msigdb/converter/MsigdbToBiopaxConverter.java#L119 .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PathwayCommons/msigdb-to-biopax/issues/2#issuecomment-298695887, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUCC38iiBspBM4g587hPD_U3r9eji_Bks5r12DXgaJpZM4NLtln .

IgorRodchenkov commented 7 years ago

The README here says that we use TFT only records from C3, no?

IgorRodchenkov commented 7 years ago

Also, see this line.

IgorRodchenkov commented 7 years ago

Looks like, although it says "TF" in the README, the converter in fact processed only "Motif" C3 sub-set - that's why Rna... OMG... What shall we do then? Fix the README and keep converting and importing "Motif" (miRna as controllers..) or "TFT" (Proteins), or both?

IgorRodchenkov commented 7 years ago

... ah, I don't see any "Motif" (case insensitive) in the MSigDB 6.0 (perhaps that was in v5_2...)

IgorRodchenkov commented 7 years ago

Well, looks like "TFT" becomes "Motif" after parsing the input data with the GSEA java lib... So, we do process only TFT data.

ozgunbabur commented 7 years ago

In that case, yes I guess it should be a Protein, not an RNA.

On Tue, May 2, 2017 at 11:08 AM, Igor Rodchenkov notifications@github.com wrote:

Well, looks like "TFT" becomes "Motif" after parsing the input data with the GSEA java lib... So, we do process only TFT data.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PathwayCommons/msigdb-to-biopax/issues/2#issuecomment-298715010, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUCC5FXzb8c24cK4CzLHIwOMoXz4QmPks5r13EegaJpZM4NLtln .

IgorRodchenkov commented 7 years ago

Each C3 TFT there is about a TF binding motif (cis, near a promoter) and the genes where it is found. So, basically, this converter generates a TemplateReactionRegulation for the TF matching the motif (if matches in TRANSFAC) and multiple TemplateReactions (TR) - per member gene - that the TF "controls". It generates a Rna for each member gene and adds that as product (property value) to the corresponding TR.

MSigDB 6.0 has now slightly different content compared with 5.2 that we processed before. We cannot extract the TF gene symbol, such as "HOX5A" anymore; it's gone. TFs (mistakenly modelled as Rna) are now simply impossible to get from the MSigDB XML data alone. In MSigDB v5.0, the converter extracts a TF gene symbol from the DESCRIPTION_BRIEF attribute, by using a regex matching "which matches annotation for HOXA5: ..." text - to get "HOXA5" (this way it also skips unknown motifs/tfs).

MSigDB v6.0 example:

    <GENESET STANDARD_NAME="HOX13_01" SYSTEMATIC_NAME="M9036" HISTORICAL_NAMES="" ORGANISM="Homo sapiens" PMID=""
             AUTHORS="" GEOID="" EXACT_SOURCE="" GENESET_LISTING_URL="" EXTERNAL_DETAILS_URL="" CHIP="HUMAN_GENE_SYMBOL"
             CATEGORY_CODE="C3" SUB_CATEGORY_CODE="TFT" CONTRIBUTOR="Xiaohui Xie" CONTRIBUTOR_ORG="Broad Institute"
             DESCRIPTION_BRIEF="Genes having at least one occurence of the transcription factor binding site V$HOX13_01 (v7.4 TRANSFAC) in the regions spanning up to 4 kb around their transcription starting sites."
             DESCRIPTION_FULL="" TAGS="" 
             MEMBERS="HIST1H2AC,TNRC6,MGC4268,GNAI1,GPR161,LMO4,CALD1,PRRX1,GSH1,NEO1,SYNPO2L,LOC51059,PCTK1,BHC80,TRIM8,NACSIN,UNC5B,SMOC1,CGI-30,FEZL,TRPV6,UNC5C,PPP2R2B,TFAP2BL1,YES1,CASKIN1,NTF3,FOXA1,EMX2,COLEC10,NDFIP1,IGF1,HUMPPA,ZA20D1,SON,CCND1,FLJ44313,PDE2A,HOXB5,SVIL,HOXD4,CROC4,KCNH3,LOC124402,ZNF485,FBXO11" 
             MEMBERS_SYMBOLIZED="HIST1H2AC,TNRC6A,AMMECR1L,GNAI1,GPR161,LMO4,CALD1,PRRX1,GSX1,NEO1,SYNPO2L,FAM135B,CDK16,PHF21A,TRIM8,EHBP1,UNC5B,SMOC1,DPH5,FEZF2,TRPV6,UNC5C,PPP2R2B,TFAP2D,YES1,CASKIN1,NTF3,FOXA1,EMX2,COLEC10,NDFIP1,IGF1,CDR2L,OTUD7B,SON,CCND1,FLJ44313,PDE2A,HOXB5,SVIL,HOXD4,C1orf61,KCNH3,FAM100A,ZNF485,FBXO11" ...
             FOUNDER_NAMES="" REFINEMENT_DATASETS="" VALIDATION_DATASETS="">
    </GENESET>

MSigDB v5.2 example:

    <GENESET STANDARD_NAME="V$HOX13_01" SYSTEMATIC_NAME="M9036" HISTORICAL_NAMES="" ORGANISM="Homo sapiens" PMID=""
             AUTHORS="" GEOID="" EXACT_SOURCE="" GENESET_LISTING_URL="" EXTERNAL_DETAILS_URL="" CHIP="HUMAN_GENE_SYMBOL"
             CATEGORY_CODE="C3" SUB_CATEGORY_CODE="TFT" CONTRIBUTOR="Xiaohui Xie" CONTRIBUTOR_ORG="Broad Institute"
             DESCRIPTION_BRIEF="Genes with promoter regions [-2kb,2kb] around transcription start site containing the motif TGCNHNCWYCCYCATTAKTNNDCNMNHYCN which matches annotation for HOXA5: homeobox A5"
             DESCRIPTION_FULL="XX&lt;br&gt; XX&lt;br&gt; XX&lt;br&gt; FA  HOXA5&lt;br&gt; XX&lt;br&gt; SY  Hox-1.3; Hoxa-5; Chox-1.3 (chick).&lt;br&gt; XX&lt;br&gt; OS  mouse, Mus musculus&lt;br&gt; OC  eukaryota; animalia; metazoa; chordata; vertebrata; tetrapoda; mammalia;&lt;br&gt; OC  eutheria; rodentia; ..."
             TAGS="" 
             MEMBERS="HIST1H2AC,TNRC6,MGC4268,GNAI1,GPR161,LMO4,CALD1,PRRX1,GSH1,NEO1,SYNPO2L,LOC51059,PCTK1,BHC80,TRIM8,NACSIN,UNC5B,SMOC1,CGI-30,FEZL,TRPV6,UNC5C,PPP2R2B,TFAP2BL1,YES1,CASKIN1,NTF3,FOXA1,EMX2,COLEC10,NDFIP1,IGF1,HUMPPA,ZA20D1,SON,CCND1,FLJ44313,PDE2A,HOXB5,SVIL,HOXD4,CROC4,KCNH3,LOC124402,ZNF485,FBXO11"
...
             FOUNDER_NAMES="" REFINEMENT_DATASETS="" VALIDATION_DATASETS="">
    </GENESET>

Finally, this converter ignores SUB_CATEGORY_CODE="MIR" entries:

    <GENESET STANDARD_NAME="GCGCTTT,MIR-518B,MIR-518C,MIR-518D" SYSTEMATIC_NAME="M11751" HISTORICAL_NAMES="" ORGANISM="Homo sapiens"
             PMID="" AUTHORS="" GEOID="" EXACT_SOURCE="" GENESET_LISTING_URL="" EXTERNAL_DETAILS_URL="" CHIP="HUMAN_GENE_SYMBOL"
             CATEGORY_CODE="C3" SUB_CATEGORY_CODE="MIR" CONTRIBUTOR="Xiaohui Xie" CONTRIBUTOR_ORG="Broad Institute"
             DESCRIPTION_BRIEF="Targets of MicroRNA GCGCTTT,MIR-518B,MIR-518C,MIR-518D"
             DESCRIPTION_FULL="" TAGS=""
             MEMBERS="SCRT2,TBC1D10B,AP1G1,SOX11,TFE3,HMP19,NRXN1,PTPRU,TEAD3,TSN,KCNK12,MCF2L,BRUNOL4,OTP,HOXC8,KIAA2022,HOXA3,NFE2L1,RAP1B,ZNF608"
             MEMBERS_SYMBOLIZED="SCRT2,TBC1D10B,AP1G1,SOX11,TFE3,HMP19,NRXN1,PTPRU,TEAD3,TSN,KCNK12,MCF2L,CELF4,OTP,HOXC8,KIAA2022,HOXA3,NFE2L1,RAP1B,ZNF608"
             MEMBERS_EZID="85508,26000,164,6664,7030,51617,9378,10076,7005,7247,56660,23263,56853,23440,3224,340533,3200,4779,5908,57507"
             MEMBERS_MAPPING="SCRT2,SCRT2,85508|TBC1D10B,TBC1D10B,26000|AP1G1,AP1G1,164|SOX11,SOX11,6664|TFE3,TFE3,7030|HMP19,HMP19,51617|NRXN1,NRXN1,9378|PTPRU,PTPRU,10076|TEAD3,TEAD3,7005|TSN,TSN,7247|KCNK12,KCNK12,56660|MCF2L,MCF2L,23263|BRUNOL4,CELF4,56853|OTP,OTP,23440|HOXC8,HOXC8,3224|KIAA2022,KIAA2022,340533|HOXA3,HOXA3,3200|NFE2L1,NFE2L1,4779|RAP1B,RAP1B,5908|ZNF608,ZNF608,57507"
             FOUNDER_NAMES="" REFINEMENT_DATASETS="" VALIDATION_DATASETS="">
    </GENESET>
IgorRodchenkov commented 7 years ago

Ok, done. Won't use v6.0 this time. Perhaps later on, we're to implement a new converter that processes original TRANSFAC tables instead of or in addition to MSigDB...

armish commented 7 years ago

Sorry that I wasn't able to help with this as I was away for a while, but looks like you finalized the implementation and resolved the issue already?

But yeah - it was a mistake. The products are the ones that should be modeled as Rnas not the regulator, which should be a Protein instead. The motif part is just intermediate processing and I don't think it is relevant in the final BioPAX model, so should be OK to drop any motif-related information from the output.

IgorRodchenkov commented 7 years ago

Yeah, I got it right. Thanks a lot Arman!

On Fri, May 5, 2017 at 10:28 AM, B. Arman Aksoy notifications@github.com wrote:

Sorry that I wasn't able to help with this as I was away for a while, but looks like you finalized the implementation and resolved the issue already?

But yeah - it was a mistake. The products are the ones that should be modeled as Rnas not the regulator, which should be a Protein instead. The motif part is just intermediate processing and I don't think it is relevant in the final BioPAX model, so should be OK to drop any motif-related information from the output.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/PathwayCommons/msigdb-to-biopax/issues/2#issuecomment-299479852, or mute the thread https://github.com/notifications/unsubscribe-auth/AA8fwahCEcaWsC5fo17GzUP-aqQCRnWfks5r2zIRgaJpZM4NLtln .