broadinstitute / seqr

web-based analysis tool for rare disease genomics
GNU Affero General Public License v3.0
175 stars 89 forks source link

Adding cis regulatory element noncoding region annotation (SCREEN) #2276

Closed anneodonnell closed 2 years ago

anneodonnell commented 2 years ago

Is your feature request related to a problem? Please describe. This is another suggestion from Jessica Chong at UW that would also be useful to our team. For genome analysis, we don't know how to identify interesting noncoding regions. A group has put together a list of high priority noncoding regions likely to be enriched for regions important for gene regulation.

There are almost 1 million regions (albeit each one is small, on a few hundred bp).

SCREEN bed file (containing all predicted human cis regulatory elements in ENCODE by the Weng Lab): https://api.wenglab.org/screen_v13/fdownloads/GRCh38-ccREs.bed

Describe the solution you'd like Initially, I was envisioning this as a region list we would search but I suspect it's too many regions.

Describe alternatives you've considered It might be better to rather add an annotation to variants that overlap these regions noting TRUE/FALSE whether a variant is in one of these regions and allowing this to be another annotation that can be used in filtering.

As these are mostly/all noncoding, they often won't be associated with a particular gene (in that we won't a priori know which gene is relevant for each element).

I think we'd also want the SVs annotated for whether they intersect/overlap with one of these regions.

Additional context UW team is starting to use this bed file in analysis and may have more perspective to share if we need.

hanars commented 2 years ago

So one low-effort idea I have is that if you create a gene list in seqr that specifies intervals instead of genes any variant that falls in any of the intervals will get a gene list tag below the variant position, similar to the gene list tags that shows up on genes. So we could just make a gene list of these regios in seqr, and while I agree that searching on it would not really work, we could add the gene list to our genome projects and the variants would always be flagged. This already works for SVs.

image image

hanars commented 2 years ago

discussed offline, will do as a boolean annotation in the pipeline

jxchong commented 2 years ago

FYI, there are other similarly large/even larger region files (generated from other data sources) that highlight putative non-coding elements that are, for example, timing or tissue-specific. In the future, I could see that someone might want to be able to filter on, e.g. regions from a muscle-specific, regions from a brain-specific list, or regions that are active during fetal development months 4-6.

Just wanted to mention it now in case it affects how you implement this.

karynne7 commented 2 years ago

The way I implement adding SCREEN annotations to GEMINI in our previous workflow was using this text file (bed file) formatted like this:

1   713942  714292  EH38E1310158    0   .   778562  778912  255,0,0 PLS,CTCF-bound  Cell-type-Agnostic-Classification
1   714466  714735  EH38E1310159    0   .   779086  779355  255,0,0 PLS,CTCF-bound  Cell-type-Agnostic-Classification
1   715107  715440  EH38E1310160    0   .   779727  780060  255,167,0   pELS,CTCF-bound Cell-type-Agnostic-Classification

It then annotates the SCREEN field with the "pELS,CTCF-bound"-type result in the database, not just a boolean. If this is too many regions for seqr, is it also too much to break out the file by locus into a table you can quickly annotate the hail matrix table with? I understand this would be a dense file, is this the performance slow down I have heard about? I haven't tried it, and I'm not a Hail expert, but this is how it makes sense to me with what I know.

jxchong commented 2 years ago

Following up on this now that it's the new year.

As Karynne described, it's useful to be able to filter on the type of region, so you might consider having multiple booleans, i.e. "is_pELS" "is_CTCF-bound" "is_PLS" because a user would likely want to first look at variants that are likely to have the strongest impact (impact a Promoter Like Sequence) which would be analogous to first analyzing variants annotated as High Impact.

lynnpais commented 2 years ago

Related pipeline ticket #288

mike-w-wilson commented 2 years ago

This has been added to the loading pipeline in https://github.com/broadinstitute/seqr-loading-pipelines/pull/355

hanars commented 2 years ago

can you update here once theres a real index with this data in it?

mike-w-wilson commented 2 years ago

Index is r0652_pipeline_testwgsgrch38variantsv02vcfv3120220922

hanars commented 2 years ago

@anneodonnell this is now able to be used in seqr. Presumably we want this to be available for both search and displaying it on search results. For search, is adding a checkbox to the "annotations" section okay? And for display I was thinking of showing it similar to how we show loftee flags

Screen Shot 2022-09-22 at 2 23 54 PM
anneodonnell commented 2 years ago

Agree we want it in both places. For annotations section, a check box for the various categories in SCREEN is good to select the ones that we want returned in the search.

These tags are to help us try to interpret noncoding variation - there is a lot of it and most of it isn't important. These annotations are not nearly as important as LoF tags and I don't think we want them standing out like that.

One option would be to include them at the bottom on the in silico annotations list with a yellow circle marking them (and if we find that any turn out to be particularly helpful as we use these metrics, we could change those to red circles in the future). Noncoding variants don't have many in silico scores so there is room on the variant page there. And really, that's what SCREEN is - computational predictions based on a variety of data types. They can be at the end of the list, so for coding variants, you would have to click more to see these annotations (which are more designed to help with noncoding variants anyway. Would also like @lynnpais to weigh in.

image

jxchong commented 2 years ago

I think it makes sense to put it in the list with in silico predictors like CADD/REVEL/etc (or alternatively where you would put the impact if the variant was coding). I think of the different SCREEN categories as the non-coding equivalents to missense, nonsense, splice variant.

anneodonnell commented 2 years ago

ooh, I like adding it above the variant ID as the variant type - right now that's blank! Jessica - are there SCREEN categories that should be one versus the other place? Or should they all be variant types?

jxchong commented 2 years ago

I think they are all variant types or at least all SCREEN categories are the same, albeit it's not predicting the actual impact the variant has on the region in question, just flagging that the variant is in such a region. Actually maybe a better analogy is that the SCREEN annotations are the equivalent of labeling a variant "coding" "exonic" or "intronic" -- aka not very specific but that's all we have for now!

jxchong commented 2 years ago

Thanks @hanars @mike-w-wilson in advance!!

lynnpais commented 2 years ago

Building on previous suggestions,

On the Variant Search page, under Annotations, include ‘cis-regulatory region variant’ and in the hover over mention ‘SCREEN: Search Candidate cis-Regulatory Elements by ENCODE. Registry of cCREs V3’, as we have other regulatory variant annotations.

For returned/saved variants, mention the variant type, ‘cis-regulatory region variant’, above the HGVS details, and as an in-silico score in yellow, ‘SCREEN’ with promoter-like signature, enhancer-like signature, CTCF-only or Linked gene, in the hover over similar to how we have predicted consequences for SpliceAI.

Looking forward to having these annotations in seqr!

On Thu, Sep 22, 2022 at 5:09 PM Jessica Chong @.***> wrote:

Thanks @hanars https://github.com/hanars @mike-w-wilson https://github.com/mike-w-wilson in advance!!

— Reply to this email directly, view it on GitHub https://github.com/broadinstitute/seqr/issues/2276#issuecomment-1255550737, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJA6EOVYDXHYRDPRWN4GGBLV7TDJHANCNFSM5HOVP2KQ . You are receiving this because you were mentioned.Message ID: @.***>

hanars commented 2 years ago

For returned/saved variants, mention the variant type, ‘cis-regulatory region variant’, above the HGVS details

Anne seemed to dislike the idea of showing a flag there and I'm not the biggest fan of showing the same data in 2 different places, but I can be convinced if we think this will be helpful

lynnpais commented 2 years ago

No prob, that is not too important. We can go with the default, which would be non-coding variant?

hanars commented 2 years ago

I've gotten a bit lost in the thread here - what is the consensus for how we want this to be shown on the variant?

lynnpais commented 2 years ago

Final version, ready to be implemented- Looking at the bed file, there appears to be seven groups of cis-regulatory elements: PLS, pELS, dELS, DNase-H3K4me3, CTCF-only, DNase-only, or low-DNase. There is a second annotation, 'CTCF-bound' for certain regions, but this is not important.

For seqr - On the Variant Search page, under Annotations > Other, list each of the seven groups: Promotor-like signatures (PLS), proximal Enhancer-like signatures (pELS), distal Enhancer-like signatures (dELS), DNase-H3K4me3, CTCF-only, DNase-only, or low-DNase. In the hover over mention ‘SCREEN: Search Candidate cis-Regulatory Elements by ENCODE. Registry of cCREs V3’.

For returned/saved variants, mention the variant type (one of the above seven groups), above the HGVS details. Also, display ‘SCREEN’ in yellow as an in-silico score for all relevant variants.

hanars commented 2 years ago

Per my comment above, I strongly dislike showing the same information in two places, I think it is very confusing and does not make it clear that this is the same information and instead indicates that there are two different callers supporting this call which is false. Can we please choose to either : a) Show "SCREEN: <pELS /dELS/etc.>" in yellow as a in silico predictor OR b) Show "SCREEN: <pELS /dELS/etc.>" above the hgvs details

anneodonnell commented 2 years ago

I prefer it in the hgvs details - My logic for this: while it's a prediction/predictor, it's not predicting the deleterious status of the variant but trying to predict the variant type.

hanars commented 2 years ago

This is now added to seqr. As we load new projects it will be enabled in them, for right now there is not project with these annotations as we have an unrelated blocker on loading our latest data