TheJacksonLaboratory / Gopher

GOPHER documentation
https://thejacksonlaboratory.github.io/Gopher/
Other
1 stars 1 forks source link

Testing of the feature 'regulatory exome' #123

Closed hansenp closed 6 years ago

hansenp commented 6 years ago

In order to test the feature regulatory exome I

  1. Checked out the latest version of VPV (05813531ca8ec6abe640d344efbb5c0379b7b5e8).

  2. Created a new project _test_regexome.

  3. Created viewpoints for the gene symbols MFHAS1, CXCL16, ASNS, SLC1A3, ZPBP2, UBE2S, SYNJ2, MARCH6, TLR4, and RORC using the extended approach with the following parameters:

image

  1. Clicked on Exome > Download Regulation Data. An open file dialog opened. Created an empty directory _test_regexome. Confirmed by clicking Open. The file

homo_sapiens.GRCh37.Regulatory_Build.regulatory_features.20161117.gff.gz

was downloaded and saved to the directory _test_regexome. The file has 280027 lines and each line looks as follows:

15 Regulatory_Build regulatory_region 102118789 102119129 . . . ID=ENSR00000368862;bound_end=102119230;bound_start=102118695;description=Transcription factor binding site;feature_type=TF binding site

  1. Clicked on Exome > Build regulatory exome. An open file dialog opened. Chose directory _test_regexome. Confirmed by clicking Open. The file

test_reg_exome-regulatoryExomePanel.bed

was created in _test_regexome. The file has 327 lines. The lines look e.g. as follows:

chr1    151778546   151780109   RORC-exon151778546-151780109
chr1    151788001   151788800   ENSR00000347081[CTCF_BINDING_SITE]
chr1    151789801   151790400   ENSR00000018174[PROMOTER_FLANKING_REGION]
chr1    151794601   151795200   ENSR00000347084[CTCF_BINDING_SITE]
chr1    151805016   151805532   ENSR00000347086[TF_BINDING_SITE]
chr1    151809200   151809401   ENSR00000018181[ENHANCER]
chr1    151813639   151814090   ENSR00000018183[OPEN_CHROMATIN]
chr5    10353750    10354029    MARCH6-exon10353750-10354029
hansenp commented 6 years ago

The file test_reg_exome-regulatoryExomePanel.bed

contains duplicated entries. The first lines look as follows:

chr1    151778546   151780109   RORC-exon151778546-151780109
chr1    151778546   151780109   RORC-exon151778546-151780109
chr1    151783800   151783910   RORC-exon151783800-151783910
chr1    151783800   151783910   RORC-exon151783800-151783910
chr1    151785422   151785533   RORC-exon151785422-151785533
chr1    151785422   151785533   RORC-exon151785422-151785533
chr1    151785714   151785822   RORC-exon151785714-151785822
chr1    151785714   151785822   RORC-exon151785714-151785822
chr1    151785963   151786096   RORC-exon151785963-151786096
chr1    151785963   151786096   RORC-exon151785963-151786096
chr1    151787049   151787171   RORC-exon151787049-151787171
chr1    151787049   151787171   RORC-exon151787049-151787171
chr1    151787388   151787901   RORC-exon151787388-151787901
chr1    151787388   151787901   RORC-exon151787388-151787901
chr1    151788001   151788800   ENSR00000347081[CTCF_BINDING_SITE]
chr1    151789139   151789281   RORC-exon151789139-151789281
chr1    151789139   151789281   RORC-exon151789139-151789281
chr1    151789670   151789756   RORC-exon151789670-151789756
chr1    151789670   151789756   RORC-exon151789670-151789756

Is this the intended behaviour?

hansenp commented 6 years ago

I loaded the file

test_reg_exome-regulatoryExomePanel.bed

as custom track in UCSC. Here is a screenshot for a viewpoint of the gene UBE2S and surrounding region (60 kbp):

image

It can be seen that

  1. There are regions that overlap with active fragments, i.e. target regions.
  2. The regulatory regions overlap among each other.

Is this (1. and/or 2.) the intended behaviour?

hansenp commented 6 years ago

Here another example of a viewpoint for the gene RORC:

image

It seem that overlapping exon regions of alternative transcripts cause duplicated entries for the regulatory exome (left).

pnrobinson commented 6 years ago

We should filter out the duplicated entries, thanks for pointing out this bug.

hansenp commented 6 years ago

I created a separate issue (#124) for duplicated entries. But what is about other regions that only partially overlap. For example these here:

image

If the exported regions in the BED file constitute the target regions such region should be combined into one region. True?

pnrobinson commented 6 years ago

See issue #124 I have fixed the issue with duplicates...