Open ktpolanski opened 1 year ago
I stumbled upon something related while comparing the single sample and cross-sample CRE spaces, didn't figure it merited opening a new issue.
The same sample as last time has two overlapping CREs that share strandedness - chr15_64951513_64952453_-
is made up of chr15_64951613_64951633_-;chr15_64952041_64952053_-
and is annotated as gene_tss;unanno_tss
, while chr15_64952143_64952644_-
is made up of chr15_64952243_64952270_-
only and annotated as unanno_tss
. The constituent clusters are not even 200bp apart, and the CREs overlap once fed through annotate
. What is the rationale for keeping them separate?
I mucked around with this a bit more, getting more practical insight into what may or may not lead clusters to be merged. I didn't drop a line here about it but just found myself wishing I had, so doing so now.
I stumbled upon two very overlapping CREs. The first was .
-stranded and featured the following clusters:
chr6_136288704_136288752_-;chr6_136288843_136288854_-;chr6_136288861_136288891_-;chr6_136288900_136288902_+;chr6_136289129_136289155_-;chr6_136289167_136289209_-
The second was +
-stranded and featured the following clusters:
chr6_136289011_136289028_+;chr6_136289055_136289061_+;chr6_136289061_136289082_+;chr6_136289126_136289129_+;chr6_136289162_136289164_+;chr6_136289442_136289477_+;chr6_136289951_136289956_+;chr6_136289958_136290038_+;chr6_136290039_136290049_+;chr6_136290051_136290133_+
The two cluster sets weave in and out of each other, with some being assigned to one CRE and some to the other. I compared the metadata for the CREs. They share:
typeStr
of unanno_tss
geneIDStr
of ADDG06136290871.B
(an identifier I'm having trouble finding online!)regionType
of intron
genePromoter
of none
In terms of differences:
proximity
of distal
and class
of distal
proximity
of proximal
and class
of other
As such, there is some metadata based ruling splitting clusters into CREs. I don't understand why the algorithm would allow for them to interweave their coordinates like that. The proximal/distal annotation by itself is also somewhat puzzling. How can 136289129-136289155 and 136289167-136289209 be distal, yet 136289126-136289129 and 136289162-136289164 be proximal?
I took a look at the directionality output, and spotted a CRE (
chr10_100185675_100186176_+
) which had ~200 negative strand reads and ~25 positive strand reads. Intrigued, I delved into the various intermediate files to investigate. Apparently two filtered clusters (chr10_100185889_100186059_-
andchr10_100186073_100186094_+
) fall within the annotated CRE's range, but rather than get merged they translate to separate, overlapping CREs (the other one becomeschr10_100185928_100186429_-
).My best guess as to why the two weren't merged is that all the
.
-oriented CREs in the output have atypeStr
ofunanno_tss
and agene_promoter
ofnone
orunannotated
. Meanwhilechr10_100185928_100186429_-
isgene_tss
andannotated
respectively.chr10_100185675_100186176_+
isunanno_tss
andnone
, but I presume the strandedness conflict is keeping it unmerged. I see instances ofgene_tss
es andunanno_tss
es merged, but the standedness agrees.Nevertheless, the end result is a pair of overlapping CREs that then cause confusion in
directionality
. Would it make sense to somehow refineannotate
coordinate extension to avoid these sort of situations from occurring? Or does the solution lie elsewhere?Some other musings that may be useful to whoever's reading:
--ctss_scope_bed_path
made unidirectional a lot more prevalent. This obviously did nothing in the case discussed above, but still felt worth noting.count
function explicitly uses the+
/-
/.
strandedness stored within the name (and BED) of the CRE. This whole thing was brought on by me thinking if there's a way to propagate the directionality information to a per-cell level, filtering theCB.ctss.bed
to+
/-
respectively, running individualcount
calls, and then finding the overlap in features captured to be the.
CREs exclusively. As is,count
anddirectionality
tell you different things.