Two very nearby clusters don't get merged, confusion ensues in `directionality`

I took a look at the directionality output, and spotted a CRE (chr10_100185675_100186176_+) which had ~200 negative strand reads and ~25 positive strand reads. Intrigued, I delved into the various intermediate files to investigate. Apparently two filtered clusters (chr10_100185889_100186059_- and chr10_100186073_100186094_+) fall within the annotated CRE's range, but rather than get merged they translate to separate, overlapping CREs (the other one becomes chr10_100185928_100186429_-).

My best guess as to why the two weren't merged is that all the .-oriented CREs in the output have a typeStr of unanno_tss and a gene_promoter of none or unannotated. Meanwhile chr10_100185928_100186429_- is gene_tss and annotated respectively. chr10_100185675_100186176_+ is unanno_tss and none, but I presume the strandedness conflict is keeping it unmerged. I see instances of gene_tsses and unanno_tsses merged, but the standedness agrees.

Nevertheless, the end result is a pair of overlapping CREs that then cause confusion in directionality. Would it make sense to somehow refine annotate coordinate extension to avoid these sort of situations from occurring? Or does the solution lie elsewhere?

Some other musings that may be useful to whoever's reading:

I had a ton of CREs with very small scores as a result of the default directionality performed in the 10X workflow. Passing the filtered cluster file as --ctss_scope_bed_path made unidirectional a lot more prevalent. This obviously did nothing in the case discussed above, but still felt worth noting.
The count function explicitly uses the +/-/. strandedness stored within the name (and BED) of the CRE. This whole thing was brought on by me thinking if there's a way to propagate the directionality information to a per-cell level, filtering the CB.ctss.bed to +/- respectively, running individual count calls, and then finding the overlap in features captured to be the . CREs exclusively. As is, count and directionality tell you different things.

I stumbled upon something related while comparing the single sample and cross-sample CRE spaces, didn't figure it merited opening a new issue.

The same sample as last time has two overlapping CREs that share strandedness - chr15_64951513_64952453_- is made up of chr15_64951613_64951633_-;chr15_64952041_64952053_- and is annotated as gene_tss;unanno_tss, while chr15_64952143_64952644_- is made up of chr15_64952243_64952270_- only and annotated as unanno_tss. The constituent clusters are not even 200bp apart, and the CREs overlap once fed through annotate. What is the rationale for keeping them separate?

I mucked around with this a bit more, getting more practical insight into what may or may not lead clusters to be merged. I didn't drop a line here about it but just found myself wishing I had, so doing so now.

I stumbled upon two very overlapping CREs. The first was .-stranded and featured the following clusters:

chr6_136288704_136288752_-;chr6_136288843_136288854_-;chr6_136288861_136288891_-;chr6_136288900_136288902_+;chr6_136289129_136289155_-;chr6_136289167_136289209_-

The second was +-stranded and featured the following clusters:

chr6_136289011_136289028_+;chr6_136289055_136289061_+;chr6_136289061_136289082_+;chr6_136289126_136289129_+;chr6_136289162_136289164_+;chr6_136289442_136289477_+;chr6_136289951_136289956_+;chr6_136289958_136290038_+;chr6_136290039_136290049_+;chr6_136290051_136290133_+

The two cluster sets weave in and out of each other, with some being assigned to one CRE and some to the other. I compared the metadata for the CREs. They share:

a typeStr of unanno_tss
a geneIDStr of ADDG06136290871.B (an identifier I'm having trouble finding online!)
a regionType of intron
a genePromoter of none

In terms of differences:

the first has proximity of distal and class of distal
the second has proximity of proximal and class of other

As such, there is some metadata based ruling splitting clusters into CREs. I don't understand why the algorithm would allow for them to interweave their coordinates like that. The proximal/distal annotation by itself is also somewhat puzzling. How can 136289129-136289155 and 136289167-136289209 be distal, yet 136289126-136289129 and 136289162-136289164 be proximal?

chung-lab / SCAFE

Two very nearby clusters don't get merged, confusion ensues in `directionality` #26