Data upload, Search, and Compare design (from data model), and ID assignment

eboileau commented 1 year ago

Aims/objectives.

The Search view will allow to search RNA modifications, filtering by (i) RNA modification type (e.g. m6A, m5C, ...), (ii) Technology (Quantification, Locus-specific, NGS 2nd generation, NGS 3rd generation), (iii) Organism, and also by (iv) region (3'UTR, CDS, ...), and (v) eventually by gene id/name, or genomic coordinates.

This is in principle do-able, but so far I haven't touched the details (e.g. how to implement dynamic dependent dropdowns), and the whole database/data model needs to be implemented, i.e. the question is how to combine the filters to search the DB.

The classification of detection technologies is taken from this paper, .e.g. NGS 2nd generation is subclassified into direct sequencing, chemical-assisted sequencing (m6A-SAC-seq, RiboMeth-seq, ...), Antibody-based (m6A-seq/MeRIP, ...), enzyme/protein-assisted (DART-seq, MAZTER-seq, ...).

A clear and concise description of todo items.

First questions related to the design of the Search view are

[ ] So far, we haven't considered "assembly". This was part of the prototype. This information should come with EUF, but in Sci-ModoM, I see 2 options:

(i) we only provide one assembly per organism, meaning that we might have to perform lift-overs, etc. Pros: simpler UI, less confusing for user Cons: more work for us, how to handle data submitted by the user?

(ii) we "stick" with the data, e.g. by adding H. Sapiens hg19, H. Sapiens hg38, ... Pros: we don't change the data Cons: I actually don't know how easier it will be for data submission, if a new "organism" has to be created?, more complicated UI)
[ ] Do we care about annotation source and/or version? We can report it.

eboileau commented 1 year ago

We also need to keep in mind that we want to allow users to upload their data and also compare datasets (Compare view), and we don't want e.g. to intersect regions in datasets that are from different assemblies.

eboileau commented 1 year ago

This is the current proposal:

To enforce a standard nomenclature for the modifications, technologies, etc., some DB tables (modification, technology, source, species) must be "fixed", with regular updates. Entries in these tables allow to define options for data upload, Search and Compare filters.

SMID creation should be done via requests to Sci-ModoM maintainers, with additional information on RNA type/modification, technology used, organism, tissue, etc. which then allow to update selected tables, for new modifications, technologies, organisms, etc.

Once the SMID is created, in principle, we could allow data upload. A given SMID can have one or more dataset (corresponding to EUF/bedMod files).

For EUFID association, this is done at upload, but we need to make sure the id is immediately available to fill additional tables.

Note: I don't know yet how to incorporate genomic information. To match input data and genomic data, we also need to have a standardized nomenclature for assembly and annotation (source, version). We probably need DB tables for this. As for actual data wrangling, I don't think we want to perform DB operations to replace Bedtools, so we need to come up with a smart way of adding this info to the DB for Search and Compare...

eboileau commented 11 months ago

See docs, currently under database. Assembly/annotation handling is WIP, I will open separate issues where required.

dieterich-lab / scimodom

Data upload, Search, and Compare design (from data model), and ID assignment #13

Aims/objectives.

A clear and concise description of todo items.