Closed IgorRodchenkov closed 7 years ago
@emekdemir @ozgunbabur @armish @gbader Do you have any thoughts in this regard? Am I right that Rna controllers should be changed to Proteins?.. A.S.A.P.
Look here.
MSigDB says that C3 collection contains both miRNA targets and TF targets. We should use an RNA in the former and a Protein in the latter. I did not look at the data in detail though.
On Tue, May 2, 2017 at 9:59 AM, Igor Rodchenkov notifications@github.com wrote:
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PathwayCommons/msigdb-to-biopax/issues/2#issuecomment-298695887, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUCC38iiBspBM4g587hPD_U3r9eji_Bks5r12DXgaJpZM4NLtln .
The README here says that we use TFT only records from C3, no?
Also, see this line.
Looks like, although it says "TF" in the README, the converter in fact processed only "Motif" C3 sub-set - that's why Rna... OMG... What shall we do then? Fix the README and keep converting and importing "Motif" (miRna as controllers..) or "TFT" (Proteins), or both?
... ah, I don't see any "Motif" (case insensitive) in the MSigDB 6.0 (perhaps that was in v5_2...)
Well, looks like "TFT" becomes "Motif" after parsing the input data with the GSEA java lib... So, we do process only TFT data.
In that case, yes I guess it should be a Protein, not an RNA.
On Tue, May 2, 2017 at 11:08 AM, Igor Rodchenkov notifications@github.com wrote:
Well, looks like "TFT" becomes "Motif" after parsing the input data with the GSEA java lib... So, we do process only TFT data.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PathwayCommons/msigdb-to-biopax/issues/2#issuecomment-298715010, or mute the thread https://github.com/notifications/unsubscribe-auth/ABUCC5FXzb8c24cK4CzLHIwOMoXz4QmPks5r13EegaJpZM4NLtln .
Each C3 TFT
MSigDB 6.0 has now slightly different content compared with 5.2 that we processed before. We cannot extract the TF gene symbol, such as "HOX5A" anymore; it's gone. TFs (mistakenly modelled as Rna) are now simply impossible to get from the MSigDB XML data alone. In MSigDB v5.0, the converter extracts a TF gene symbol from the DESCRIPTION_BRIEF attribute, by using a regex matching "which matches annotation for HOXA5: ..." text - to get "HOXA5" (this way it also skips unknown motifs/tfs).
MSigDB v6.0 example:
<GENESET STANDARD_NAME="HOX13_01" SYSTEMATIC_NAME="M9036" HISTORICAL_NAMES="" ORGANISM="Homo sapiens" PMID=""
AUTHORS="" GEOID="" EXACT_SOURCE="" GENESET_LISTING_URL="" EXTERNAL_DETAILS_URL="" CHIP="HUMAN_GENE_SYMBOL"
CATEGORY_CODE="C3" SUB_CATEGORY_CODE="TFT" CONTRIBUTOR="Xiaohui Xie" CONTRIBUTOR_ORG="Broad Institute"
DESCRIPTION_BRIEF="Genes having at least one occurence of the transcription factor binding site V$HOX13_01 (v7.4 TRANSFAC) in the regions spanning up to 4 kb around their transcription starting sites."
DESCRIPTION_FULL="" TAGS=""
MEMBERS="HIST1H2AC,TNRC6,MGC4268,GNAI1,GPR161,LMO4,CALD1,PRRX1,GSH1,NEO1,SYNPO2L,LOC51059,PCTK1,BHC80,TRIM8,NACSIN,UNC5B,SMOC1,CGI-30,FEZL,TRPV6,UNC5C,PPP2R2B,TFAP2BL1,YES1,CASKIN1,NTF3,FOXA1,EMX2,COLEC10,NDFIP1,IGF1,HUMPPA,ZA20D1,SON,CCND1,FLJ44313,PDE2A,HOXB5,SVIL,HOXD4,CROC4,KCNH3,LOC124402,ZNF485,FBXO11"
MEMBERS_SYMBOLIZED="HIST1H2AC,TNRC6A,AMMECR1L,GNAI1,GPR161,LMO4,CALD1,PRRX1,GSX1,NEO1,SYNPO2L,FAM135B,CDK16,PHF21A,TRIM8,EHBP1,UNC5B,SMOC1,DPH5,FEZF2,TRPV6,UNC5C,PPP2R2B,TFAP2D,YES1,CASKIN1,NTF3,FOXA1,EMX2,COLEC10,NDFIP1,IGF1,CDR2L,OTUD7B,SON,CCND1,FLJ44313,PDE2A,HOXB5,SVIL,HOXD4,C1orf61,KCNH3,FAM100A,ZNF485,FBXO11" ...
FOUNDER_NAMES="" REFINEMENT_DATASETS="" VALIDATION_DATASETS="">
</GENESET>
MSigDB v5.2 example:
<GENESET STANDARD_NAME="V$HOX13_01" SYSTEMATIC_NAME="M9036" HISTORICAL_NAMES="" ORGANISM="Homo sapiens" PMID=""
AUTHORS="" GEOID="" EXACT_SOURCE="" GENESET_LISTING_URL="" EXTERNAL_DETAILS_URL="" CHIP="HUMAN_GENE_SYMBOL"
CATEGORY_CODE="C3" SUB_CATEGORY_CODE="TFT" CONTRIBUTOR="Xiaohui Xie" CONTRIBUTOR_ORG="Broad Institute"
DESCRIPTION_BRIEF="Genes with promoter regions [-2kb,2kb] around transcription start site containing the motif TGCNHNCWYCCYCATTAKTNNDCNMNHYCN which matches annotation for HOXA5: homeobox A5"
DESCRIPTION_FULL="XX<br> XX<br> XX<br> FA HOXA5<br> XX<br> SY Hox-1.3; Hoxa-5; Chox-1.3 (chick).<br> XX<br> OS mouse, Mus musculus<br> OC eukaryota; animalia; metazoa; chordata; vertebrata; tetrapoda; mammalia;<br> OC eutheria; rodentia; ..."
TAGS=""
MEMBERS="HIST1H2AC,TNRC6,MGC4268,GNAI1,GPR161,LMO4,CALD1,PRRX1,GSH1,NEO1,SYNPO2L,LOC51059,PCTK1,BHC80,TRIM8,NACSIN,UNC5B,SMOC1,CGI-30,FEZL,TRPV6,UNC5C,PPP2R2B,TFAP2BL1,YES1,CASKIN1,NTF3,FOXA1,EMX2,COLEC10,NDFIP1,IGF1,HUMPPA,ZA20D1,SON,CCND1,FLJ44313,PDE2A,HOXB5,SVIL,HOXD4,CROC4,KCNH3,LOC124402,ZNF485,FBXO11"
...
FOUNDER_NAMES="" REFINEMENT_DATASETS="" VALIDATION_DATASETS="">
</GENESET>
Finally, this converter ignores SUB_CATEGORY_CODE="MIR" entries:
<GENESET STANDARD_NAME="GCGCTTT,MIR-518B,MIR-518C,MIR-518D" SYSTEMATIC_NAME="M11751" HISTORICAL_NAMES="" ORGANISM="Homo sapiens"
PMID="" AUTHORS="" GEOID="" EXACT_SOURCE="" GENESET_LISTING_URL="" EXTERNAL_DETAILS_URL="" CHIP="HUMAN_GENE_SYMBOL"
CATEGORY_CODE="C3" SUB_CATEGORY_CODE="MIR" CONTRIBUTOR="Xiaohui Xie" CONTRIBUTOR_ORG="Broad Institute"
DESCRIPTION_BRIEF="Targets of MicroRNA GCGCTTT,MIR-518B,MIR-518C,MIR-518D"
DESCRIPTION_FULL="" TAGS=""
MEMBERS="SCRT2,TBC1D10B,AP1G1,SOX11,TFE3,HMP19,NRXN1,PTPRU,TEAD3,TSN,KCNK12,MCF2L,BRUNOL4,OTP,HOXC8,KIAA2022,HOXA3,NFE2L1,RAP1B,ZNF608"
MEMBERS_SYMBOLIZED="SCRT2,TBC1D10B,AP1G1,SOX11,TFE3,HMP19,NRXN1,PTPRU,TEAD3,TSN,KCNK12,MCF2L,CELF4,OTP,HOXC8,KIAA2022,HOXA3,NFE2L1,RAP1B,ZNF608"
MEMBERS_EZID="85508,26000,164,6664,7030,51617,9378,10076,7005,7247,56660,23263,56853,23440,3224,340533,3200,4779,5908,57507"
MEMBERS_MAPPING="SCRT2,SCRT2,85508|TBC1D10B,TBC1D10B,26000|AP1G1,AP1G1,164|SOX11,SOX11,6664|TFE3,TFE3,7030|HMP19,HMP19,51617|NRXN1,NRXN1,9378|PTPRU,PTPRU,10076|TEAD3,TEAD3,7005|TSN,TSN,7247|KCNK12,KCNK12,56660|MCF2L,MCF2L,23263|BRUNOL4,CELF4,56853|OTP,OTP,23440|HOXC8,HOXC8,3224|KIAA2022,KIAA2022,340533|HOXA3,HOXA3,3200|NFE2L1,NFE2L1,4779|RAP1B,RAP1B,5908|ZNF608,ZNF608,57507"
FOUNDER_NAMES="" REFINEMENT_DATASETS="" VALIDATION_DATASETS="">
</GENESET>
Ok, done. Won't use v6.0 this time. Perhaps later on, we're to implement a new converter that processes original TRANSFAC tables instead of or in addition to MSigDB...
Sorry that I wasn't able to help with this as I was away for a while, but looks like you finalized the implementation and resolved the issue already?
But yeah - it was a mistake. The products are the ones that should be modeled as Rna
s not the regulator, which should be a Protein
instead. The motif part is just intermediate processing and I don't think it is relevant in the final BioPAX model, so should be OK to drop any motif-related information from the output.
Yeah, I got it right. Thanks a lot Arman!
On Fri, May 5, 2017 at 10:28 AM, B. Arman Aksoy notifications@github.com wrote:
Sorry that I wasn't able to help with this as I was away for a while, but looks like you finalized the implementation and resolved the issue already?
But yeah - it was a mistake. The products are the ones that should be modeled as Rnas not the regulator, which should be a Protein instead. The motif part is just intermediate processing and I don't think it is relevant in the final BioPAX model, so should be OK to drop any motif-related information from the output.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/PathwayCommons/msigdb-to-biopax/issues/2#issuecomment-299479852, or mute the thread https://github.com/notifications/unsubscribe-auth/AA8fwahCEcaWsC5fo17GzUP-aqQCRnWfks5r2zIRgaJpZM4NLtln .
All the TFs (to be used for TemplateReactionRegulation/controller) are there created as Rna rather than Protein, why?.. Sounds like a mistake.