fmicompbio / monaLisa

binned motif enrichment analysis and visualisation
https://fmicompbio.github.io/monaLisa/
GNU General Public License v3.0
38 stars 7 forks source link

cluster (enriched) k-mers #30

Closed mbstadler closed 4 years ago

mbstadler commented 4 years ago

using the return value from getKmerFreq, cluster k-mers into groups of overlapping/similar k-mers (same motif?), and find closest known weight matrix.

Useful to have an option to ignore the reverse/complement (e.g. RNA motifs) or include them (e.g. DNA motifs)

Maybe, functionality from BioC package motifRG is useful: It implements ways to identify enriched k-mers in a foreground sequence set versus a background set. It is flagged for depreciation from BioC 3.11, but I think maybe there are some things in there that could be interesting for us (e.g. how to combine/extend k-mers to longer motifs, see e.g. motifRG::findMotif and motifRG::refinePWMMotif

Idea: produce a local alignment of k-mers, get a distance matrix, use that for clustering graph-based: nodes are k-mers, edges between similar k-mers, clusters are communities

mbstadler commented 4 years ago

I implemented a first version (branch cluster_kmers). It seems to do ok on a couple of synthetic examples. For the moment, I will not yet merge it because I would like to incubate it some more and try it on some real data. Of course you can try it out already from the above branch - any feedback is appreciated.

mbstadler commented 4 years ago

I have done some further testing, though the algorithm is still only a draft, and I think it will require further tweaking to generalize well. I will still merge it (PR #33) so that it is easier for you to also test.