Open mellybelly opened 7 years ago
Yep, happy to help prepare data and ingest the data. I have worked with TRANSFAC and JASPAR in terms of transcription factor binding site prediction and motif scanning - waaay back (10 years or so). Let me know if you have specific ideas/datasets/downloads to start from. cheers
@tknijnen perhaps you can meet with @mbrush @TomConlin and/or @kshefchek and they can show you the ropes? This might be a nice way to have your perspective on our ETL processes and collaborate, especially if you are knowledgable about these datasets (if if some time ago). @stuppie maybe also relevant for wikidata simultaneously? This is an important dataset for our FA queries.
Hi @tknijnen. We could definitely use your help here, defining requirements for our use case and finding the best resources to target for ingest. Our goal is to identify which transcription factors regulate expression of Fanconi genes, to build a picture of their gene regulatory networks. From here, we can select the most promising tx factor genes to expand our gene set.
After some initial exploration, it seems that identifying upstream tx regulators for FA genes may not be so straightforward. Resources like JASPAR and TRANSFAC provide information about the sequence motif that is bound by a given transcription factor - but they don’t tell you which genes might be regulated by each transcription factor. Other resources such as ENCODE provide ChIP-seq datasets that demonstrate binding of tx factors to genomic regions, mostly based on cell line experiments.
Using data from such resources, one could perform a computational analysis that predicts what tx factors may regulate particular genes (based on identification of a tx factor's binding motifs near the gene, and some ChIP-seq evidence of the factor binding in this region). It seems that Ensembl has done just this to build a track that shows regulatory regions and predicted tx factor binding sites (see blog post here) - so this data may be a candidate resource for ingest. This would give us a set of predicted regulators of FA genes, but there may not be any experimental evidence to support such regulation.
There are also resources such as PAZAR and ORegAnno which curate the literature to pull experimentally validated associations of genes and tx factors that regulate them. But I don’t know how current or comprehensive they are, and would guess that the data we would pull here would be sparse and missing many associations that haven't been published on.
So, hoping your experience and expertise can guide us here @tknijnen - in identifying other resources or approaches we might explore, and helping to model the data as part of our DIPper ingest pipeline. My summary above is based on a very quick exploration of the landscape - so I'm sure I am missing something!
Some other potential resources: TRRUST: http://www.grnpedia.org/trrust/Network_search_form.php RegNetwork: http://www.regnetworkweb.org/ OmicsTools list: https://omictools.com/gene-regulatory-network-data-category
My parts in this ticket are effectively moved to:
-Jaspar_FA
-OrangeQ1.5_Regulatory_Motif_Signature
@mellybelly The approach I am exploring following your directives
note: Jaspar has about 1.1M instances of their 141 motifs located on hg19 this excludes over 80% of them.
This allows questions such as: Given this (FA) gene, what other genes include the same set of transcription factors along with how close and how many are there .
This approach is purely mechanistic, and does not seek to address if a motif is in fact a transcription factor for a nearby gene.
IRI for motifs would be like
http://fantom.gsc.riken.jp/5/sstar/JASPAR_motif:MA0007.1
genes are currently represented as RefSeq we may want NCBIGene: instead
the sequence Ontology has:
Thing
sequence_variant
structural_variant
feature_variant
intergenic_variant
upstream_gene_variant
2KB_upstream_variant
5KB_upstream_variant
which does not give me warm fuzzies
1) not wild about calling transcription factors variants
2) there is no 1KB
3) would expect 1KB part_of 2KB part_of 5KB
I don't have SO open, but you'd use gene for the gene and some subclass of regulatory region
for the region the TF binds to
More specifically, SO provides a TF_binding_site class (SO_0000235). TF binding sites are cannonical regulatory regions, not 'variants'. The upstream and downstream variant classes are meant to capture proximity to known genes when annotating intergenic variants.
Some links to explore gene regulation data in Ensembl, including Tx Factor binding sites in regulatory regions validated by experimental data (e.g. Chip-seq from ENCODE). They have an API through which we may be able to dynamically access the data we need. Or we may decide to model and ingest via dipper.
Dear Matt, (nice to meet you!), I checked Ensembl for TF binding sites. I think it looks really nice, well-documented and a good start for what we need for the hackaton. I think that we should use this for the time being. We could later evaluate whether we need to ingest additional data ourselves and/or implement and run computational approaches (such as TF scanning) to create the data that we want to have.
cc @wrighth
We need to pull a series of data to inform which genes to look for variants in.
There is a google doc here for reference,
This relates to Set-5.
but essentially we need a gene set based upon any upstream transcriptional regulators of our FA primary genes (some may have alternate primary symbols): FANCA, FANCB, FANCC, FANCE, FANCF, FANCG, FANCL, FANCM, FANCD2, FANCI, UBE2T FANCD1 (BRCA2), FANCJ, FANCN, FANCO, FANCP, FANCQ, FANCR, FANCS, FANCV, FANCU FAAP100, FAAP24, FAAP20, FAAP16 (MHF1), FAAP10 (MHF2)
Favored resource is JASPAR http://jaspar.genereg.net/ as open source and provisioned Wyeth's lab TRANSFAC http://genexplain.com/transfac/ may also be available for use just to generate the queries
@tknijnen - in addition to helping with query, perhaps you would like to, as per this ticket, assist in getting the data ingested as a collaborative exercise? @dnahotline will advise as needed @cmungall related to architecture