chung-lab / SCAFE

Single Cell Analysis of Five'Ends
MIT License
45 stars 11 forks source link

Two very nearby clusters don't get merged, confusion ensues in `directionality` #26

Open ktpolanski opened 1 year ago

ktpolanski commented 1 year ago

I took a look at the directionality output, and spotted a CRE (chr10_100185675_100186176_+) which had ~200 negative strand reads and ~25 positive strand reads. Intrigued, I delved into the various intermediate files to investigate. Apparently two filtered clusters (chr10_100185889_100186059_- and chr10_100186073_100186094_+) fall within the annotated CRE's range, but rather than get merged they translate to separate, overlapping CREs (the other one becomes chr10_100185928_100186429_-).

My best guess as to why the two weren't merged is that all the .-oriented CREs in the output have a typeStr of unanno_tss and a gene_promoter of none or unannotated. Meanwhile chr10_100185928_100186429_- is gene_tss and annotated respectively. chr10_100185675_100186176_+ is unanno_tss and none, but I presume the strandedness conflict is keeping it unmerged. I see instances of gene_tsses and unanno_tsses merged, but the standedness agrees.

Nevertheless, the end result is a pair of overlapping CREs that then cause confusion in directionality. Would it make sense to somehow refine annotate coordinate extension to avoid these sort of situations from occurring? Or does the solution lie elsewhere?

Some other musings that may be useful to whoever's reading:

ktpolanski commented 1 year ago

I stumbled upon something related while comparing the single sample and cross-sample CRE spaces, didn't figure it merited opening a new issue.

The same sample as last time has two overlapping CREs that share strandedness - chr15_64951513_64952453_- is made up of chr15_64951613_64951633_-;chr15_64952041_64952053_- and is annotated as gene_tss;unanno_tss, while chr15_64952143_64952644_- is made up of chr15_64952243_64952270_- only and annotated as unanno_tss. The constituent clusters are not even 200bp apart, and the CREs overlap once fed through annotate. What is the rationale for keeping them separate?

ktpolanski commented 1 year ago

I mucked around with this a bit more, getting more practical insight into what may or may not lead clusters to be merged. I didn't drop a line here about it but just found myself wishing I had, so doing so now.

I stumbled upon two very overlapping CREs. The first was .-stranded and featured the following clusters:

chr6_136288704_136288752_-;chr6_136288843_136288854_-;chr6_136288861_136288891_-;chr6_136288900_136288902_+;chr6_136289129_136289155_-;chr6_136289167_136289209_-

The second was +-stranded and featured the following clusters:

chr6_136289011_136289028_+;chr6_136289055_136289061_+;chr6_136289061_136289082_+;chr6_136289126_136289129_+;chr6_136289162_136289164_+;chr6_136289442_136289477_+;chr6_136289951_136289956_+;chr6_136289958_136290038_+;chr6_136290039_136290049_+;chr6_136290051_136290133_+

The two cluster sets weave in and out of each other, with some being assigned to one CRE and some to the other. I compared the metadata for the CREs. They share:

In terms of differences:

As such, there is some metadata based ruling splitting clusters into CREs. I don't understand why the algorithm would allow for them to interweave their coordinates like that. The proximal/distal annotation by itself is also somewhat puzzling. How can 136289129-136289155 and 136289167-136289209 be distal, yet 136289126-136289129 and 136289162-136289164 be proximal?