dieterich-lab / scimodom

GNU Affero General Public License v3.0
0 stars 0 forks source link

Implement server-side Compare view #37

Closed eboileau closed 7 months ago

eboileau commented 8 months ago

Aims/objectives.

The Compare view is supposed to allow users to "compare" datasets e.g. select one or more reference dataset, find the intersection, difference, etc. with one or more dataset (incl. user upload). Doing this with MySQL operations is not impossible, but difficult, and limited. If we eventually consider e.g. intersection with single-sites and intervals, then I wouldn't know how to do this.

We could use pybedtools e.g. (i) convert DB query records to BedTool on the fly, (ii) perform operations, (iii) convert back to json-like format for transfer, (iv) cache results for lazy loading, (v) send "lazy results". But we eventually need to be careful how pybedtools writes tmp files.

Using pybedtools would somewhat simplify #35 and allow more complex operations such a window, closest, etc., but will nevertheless require some time to implement. This also means that we have to implement a Redis cache (but this could serve as a template also for caching Search view records).

A clear and concise description of todo items.

We also need to fix how we "serialize" the results, to avoid too much boilerplate.

eboileau commented 8 months ago

In fact, for many operations we wouldn't need bedtools, e.g. intersection

SELECT chrom, start, end, strand FROM data WHERE dataset_id = "m6scxaP6zQUS" INTERSECT SELECT chrom, start, end, strand FROM data WHERE dataset_id = "m9oqVKVmztwj";

if we'd be looking for "exact matches", but when dealing with site-specific and interval data e.g. 65nSnjiT8Uue, then there is no straightforward way of doing this. Besides, I think bedtools might be faster.

Same for sorting, I think bedtools is faster than e.g. .order_by(Data.chrom.asc(), Data.start.asc()).

eboileau commented 7 months ago

Instead of lazy loading + caching, we use virtual scroll with pre-load, this takes a few seconds at most for larger operations, sorting takes a few milliseconds. I think this is acceptable as a short-medium-term solution, otherwise implementing the BE logic for lazy loading would take some time. Besides, pybedtools is quite fast, this is not an issue, at least for now, and I'm not sure in comparison if we could beat that with caching + wrangling + BE sorting.