Ingest transcription factor binding databases for FA demonstrator

mellybelly commented 7 years ago

We need to pull a series of data to inform which genes to look for variants in.

There is a google doc here for reference,

This relates to Set-5.

but essentially we need a gene set based upon any upstream transcriptional regulators of our FA primary genes (some may have alternate primary symbols): FANCA, FANCB, FANCC, FANCE, FANCF, FANCG, FANCL, FANCM, FANCD2, FANCI, UBE2T FANCD1 (BRCA2), FANCJ, FANCN, FANCO, FANCP, FANCQ, FANCR, FANCS, FANCV, FANCU FAAP100, FAAP24, FAAP20, FAAP16 (MHF1), FAAP10 (MHF2)

Favored resource is JASPAR http://jaspar.genereg.net/ as open source and provisioned Wyeth's lab TRANSFAC http://genexplain.com/transfac/ may also be available for use just to generate the queries

@tknijnen - in addition to helping with query, perhaps you would like to, as per this ticket, assist in getting the data ingested as a collaborative exercise? @dnahotline will advise as needed @cmungall related to architecture

tknijnen commented 7 years ago

Yep, happy to help prepare data and ingest the data. I have worked with TRANSFAC and JASPAR in terms of transcription factor binding site prediction and motif scanning - waaay back (10 years or so). Let me know if you have specific ideas/datasets/downloads to start from. cheers

mellybelly commented 7 years ago

@tknijnen perhaps you can meet with @mbrush @TomConlin and/or @kshefchek and they can show you the ropes? This might be a nice way to have your perspective on our ETL processes and collaborate, especially if you are knowledgable about these datasets (if if some time ago). @stuppie maybe also relevant for wikidata simultaneously? This is an important dataset for our FA queries.

mbrush commented 7 years ago

Hi @tknijnen. We could definitely use your help here, defining requirements for our use case and finding the best resources to target for ingest. Our goal is to identify which transcription factors regulate expression of Fanconi genes, to build a picture of their gene regulatory networks. From here, we can select the most promising tx factor genes to expand our gene set.

After some initial exploration, it seems that identifying upstream tx regulators for FA genes may not be so straightforward. Resources like JASPAR and TRANSFAC provide information about the sequence motif that is bound by a given transcription factor - but they don’t tell you which genes might be regulated by each transcription factor. Other resources such as ENCODE provide ChIP-seq datasets that demonstrate binding of tx factors to genomic regions, mostly based on cell line experiments.

Using data from such resources, one could perform a computational analysis that predicts what tx factors may regulate particular genes (based on identification of a tx factor's binding motifs near the gene, and some ChIP-seq evidence of the factor binding in this region). It seems that Ensembl has done just this to build a track that shows regulatory regions and predicted tx factor binding sites (see blog post here) - so this data may be a candidate resource for ingest. This would give us a set of predicted regulators of FA genes, but there may not be any experimental evidence to support such regulation.

There are also resources such as PAZAR and ORegAnno which curate the literature to pull experimentally validated associations of genes and tx factors that regulate them. But I don’t know how current or comprehensive they are, and would guess that the data we would pull here would be sparse and missing many associations that haven't been published on.

So, hoping your experience and expertise can guide us here @tknijnen - in identifying other resources or approaches we might explore, and helping to model the data as part of our DIPper ingest pipeline. My summary above is based on a very quick exploration of the landscape - so I'm sure I am missing something!

mbrush commented 7 years ago

Some other potential resources: TRRUST: http://www.grnpedia.org/trrust/Network_search_form.php RegNetwork: http://www.regnetworkweb.org/ OmicsTools list: https://omictools.com/gene-regulatory-network-data-category

TomConlin commented 7 years ago

UPDATE 2017 May 15

My parts in this ticket are effectively moved to:
-Jaspar_FA -OrangeQ1.5_Regulatory_Motif_Signature

@mellybelly The approach I am exploring following your directives

"need a gene set based upon any upstream transcriptional regulators" and
"Favored resource is JASPAR"
has led to:
Having which of Jaspar's 141 motifs occur within 1k , 2k and 5k upstream of a gene annotated on hg19.
this gives us:
31,253 genes with 103,832 motifs within 1kbp
34,937 genes with 133,534 motifs within 2kbp
40,072 genes with 206,316 motifs within 5kbp

note: Jaspar has about 1.1M instances of their 141 motifs located on hg19 this excludes over 80% of them.

This allows questions such as: Given this (FA) gene, what other genes include the same set of transcription factors along with how close and how many are there .

This approach is purely mechanistic, and does not seek to address if a motif is in fact a transcription factor for a nearby gene.

IRI for motifs would be like
http://fantom.gsc.riken.jp/5/sstar/JASPAR_motif:MA0007.1 genes are currently represented as RefSeq we may want NCBIGene: instead

the sequence Ontology has:

Thing 
            sequence_variant
                structural_variant 
                    feature_variant
                        intergenic_variant
                            upstream_gene_variant
                                2KB_upstream_variant
                                5KB_upstream_variant

which does not give me warm fuzzies

1) not wild about calling transcription factors variants 2) there is no 1KB
3) would expect 1KB part_of 2KB part_of 5KB

cmungall commented 7 years ago

I don't have SO open, but you'd use gene for the gene and some subclass of regulatory region for the region the TF binds to

mbrush commented 7 years ago

More specifically, SO provides a TF_binding_site class (SO_0000235). TF binding sites are cannonical regulatory regions, not 'variants'. The upstream and downstream variant classes are meant to capture proximity to known genes when annotating intergenic variants.

mbrush commented 7 years ago

Some links to explore gene regulation data in Ensembl, including Tx Factor binding sites in regulatory regions validated by experimental data (e.g. Chip-seq from ENCODE). They have an API through which we may be able to dynamically access the data we need. Or we may decide to model and ingest via dipper.

Blog post about this work: http://www.ensembl.info/blog/2011/05/18/transcription-factor-binding-sites-in-ensembl/
Example record: http://www.ensembl.org/Homo_sapiens/Regulation/Context?db=core;fdb=funcgen;r=13:32314000-32317601;rf=ENSR00000060894
More documentation: http://www.ensembl.org/info/genome/funcgen/index.html
Papers about Reg Build: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0621-5 and https://academic-oup-com.liboff.ohsu.edu/database/article/doi/10.1093/database/bav119/2630094/Ensembl-regulation-resources
Ensembl Regulation API and API tutorial: http://www.ensembl.org/info/docs/api/funcgen/index.html and http://www.ensembl.org/info/docs/api/funcgen/regulation_tutorial.html

tknijnen commented 7 years ago

Dear Matt, (nice to meet you!), I checked Ensembl for TF binding sites. I think it looks really nice, well-documented and a good start for what we need for the hackaton. I think that we should use this for the time being. We could later evaluate whether we need to ingest additional data ourselves and/or implement and run computational approaches (such as TF scanning) to create the data that we want to have.

kshefchek commented 7 years ago

cc @wrighth

NCATS-Tangerine / ncats-ingest

Ingest transcription factor binding databases for FA demonstrator #21

UPDATE 2017 May 15