dieterich-lab / scimodom

GNU Affero General Public License v3.0
0 stars 0 forks source link

Data upload, Search, and Compare design (from data model), and ID assignment #13

Closed eboileau closed 11 months ago

eboileau commented 1 year ago

Aims/objectives.

The Search view will allow to search RNA modifications, filtering by (i) RNA modification type (e.g. m6A, m5C, ...), (ii) Technology (Quantification, Locus-specific, NGS 2nd generation, NGS 3rd generation), (iii) Organism, and also by (iv) region (3'UTR, CDS, ...), and (v) eventually by gene id/name, or genomic coordinates.

This is in principle do-able, but so far I haven't touched the details (e.g. how to implement dynamic dependent dropdowns), and the whole database/data model needs to be implemented, i.e. the question is how to combine the filters to search the DB.

The classification of detection technologies is taken from this paper, .e.g. NGS 2nd generation is subclassified into direct sequencing, chemical-assisted sequencing (m6A-SAC-seq, RiboMeth-seq, ...), Antibody-based (m6A-seq/MeRIP, ...), enzyme/protein-assisted (DART-seq, MAZTER-seq, ...).

A clear and concise description of todo items.

First questions related to the design of the Search view are

eboileau commented 1 year ago

We also need to keep in mind that we want to allow users to upload their data and also compare datasets (Compare view), and we don't want e.g. to intersect regions in datasets that are from different assemblies.

eboileau commented 1 year ago

This is the current proposal:

To enforce a standard nomenclature for the modifications, technologies, etc., some DB tables (modification, technology, source, species) must be "fixed", with regular updates. Entries in these tables allow to define options for data upload, Search and Compare filters.

SMID creation should be done via requests to Sci-ModoM maintainers, with additional information on RNA type/modification, technology used, organism, tissue, etc. which then allow to update selected tables, for new modifications, technologies, organisms, etc.

Once the SMID is created, in principle, we could allow data upload. A given SMID can have one or more dataset (corresponding to EUF/bedMod files).

For EUFID association, this is done at upload, but we need to make sure the id is immediately available to fill additional tables.

Note: I don't know yet how to incorporate genomic information. To match input data and genomic data, we also need to have a standardized nomenclature for assembly and annotation (source, version). We probably need DB tables for this. As for actual data wrangling, I don't think we want to perform DB operations to replace Bedtools, so we need to come up with a smart way of adding this info to the DB for Search and Compare...

eboileau commented 11 months ago

See docs, currently under database. Assembly/annotation handling is WIP, I will open separate issues where required.