Closed hansenp closed 6 years ago
The file
test_reg_exome-regulatoryExomePanel.bed
contains duplicated entries. The first lines look as follows:
chr1 151778546 151780109 RORC-exon151778546-151780109
chr1 151778546 151780109 RORC-exon151778546-151780109
chr1 151783800 151783910 RORC-exon151783800-151783910
chr1 151783800 151783910 RORC-exon151783800-151783910
chr1 151785422 151785533 RORC-exon151785422-151785533
chr1 151785422 151785533 RORC-exon151785422-151785533
chr1 151785714 151785822 RORC-exon151785714-151785822
chr1 151785714 151785822 RORC-exon151785714-151785822
chr1 151785963 151786096 RORC-exon151785963-151786096
chr1 151785963 151786096 RORC-exon151785963-151786096
chr1 151787049 151787171 RORC-exon151787049-151787171
chr1 151787049 151787171 RORC-exon151787049-151787171
chr1 151787388 151787901 RORC-exon151787388-151787901
chr1 151787388 151787901 RORC-exon151787388-151787901
chr1 151788001 151788800 ENSR00000347081[CTCF_BINDING_SITE]
chr1 151789139 151789281 RORC-exon151789139-151789281
chr1 151789139 151789281 RORC-exon151789139-151789281
chr1 151789670 151789756 RORC-exon151789670-151789756
chr1 151789670 151789756 RORC-exon151789670-151789756
Is this the intended behaviour?
I loaded the file
test_reg_exome-regulatoryExomePanel.bed
as custom track in UCSC. Here is a screenshot for a viewpoint of the gene UBE2S and surrounding region (60 kbp):
It can be seen that
Is this (1. and/or 2.) the intended behaviour?
Here another example of a viewpoint for the gene RORC:
It seem that overlapping exon regions of alternative transcripts cause duplicated entries for the regulatory exome (left).
We should filter out the duplicated entries, thanks for pointing out this bug.
I created a separate issue (#124) for duplicated entries. But what is about other regions that only partially overlap. For example these here:
If the exported regions in the BED file constitute the target regions such region should be combined into one region. True?
See issue #124 I have fixed the issue with duplicates...
In order to test the feature regulatory exome I
Checked out the latest version of VPV (05813531ca8ec6abe640d344efbb5c0379b7b5e8).
Created a new project _test_regexome.
Created viewpoints for the gene symbols MFHAS1, CXCL16, ASNS, SLC1A3, ZPBP2, UBE2S, SYNJ2, MARCH6, TLR4, and RORC using the extended approach with the following parameters:
homo_sapiens.GRCh37.Regulatory_Build.regulatory_features.20161117.gff.gz
was downloaded and saved to the directory _test_regexome. The file has 280027 lines and each line looks as follows:
15 Regulatory_Build regulatory_region 102118789 102119129 . . . ID=ENSR00000368862;bound_end=102119230;bound_start=102118695;description=Transcription factor binding site;feature_type=TF binding site
test_reg_exome-regulatoryExomePanel.bed
was created in _test_regexome. The file has 327 lines. The lines look e.g. as follows: