Bioconductor / Organism.dplyr

https://bioconductor.org/packages/Organism.dplyr
3 stars 3 forks source link

Appropriate table structure #2

Open mtmorgan opened 7 years ago

mtmorgan commented 7 years ago

Organism.dplyr simplifies the bimap and table structure of org and TxDb packages to a small number of tables, but what are the optimal arrangement and membership of tables? Already Organism.dplyr is much more user-friendly than the org / TxDb / Homo.sapiens packages, so is valuable for that reason alone. Note that the genes(), transcripts(), exons(), and cds() verbs are already contracted to return a GRanges; we have genes_tbl() etc returning tibbles.

jorainer commented 7 years ago

And what about reducing the number of tables a little more by adding also the genomic coordinates to the gene, tx and exon database tables? then each of these tables would contain all relevant information for that entity and users don't have to join a gene and ranges_gene table to get all gene associated information.

lawremi commented 7 years ago

I think we should (virtually at least) have a single table representing the data. That encapsulates all data in a single object, so important for e.g. plotting track views. Obviously, the TxDb already is that object; this would just provide a denormalized, tabular view of it (and thus would integrate well with the tidyverse). I'm not so worried about the performance of the table, because it could be efficiently filtered and reduced, with the right indexes in place.

As I said before, GFF3 is a good starting point, but instead of representing nested features using the "Parent" and "ID" attributes, we might have just "tx_id" and "gene_id" for the two-level grouping, since we are not trying to be as general. Then, getting the transcripts, or genes, or exons, etc, would just be a filter() call on the "type" column.

If it's possible for a "tbl" to contain non-vector columns, we could have the option to return a tbl with the actual coordinates stored as a GRanges.

mtmorgan commented 7 years ago

I agree that people don't do well with multiple tables (Actually, I don't know that they've had much chance to learn the basic join semantic; maybe they would grok it quite easily, at least as an incremental skill over the basic table operations).

Even though not particularly 'rich', the org annotations are problematic when tables have 1 (e.g., gene id):many (e.g., GO) mappings. But maybe you're thinking of just the TxDb components. It would be very helpful to construct some (Organism.dplyr-derived) real examples of what an appropriate table should look like.

Also, I'm not sure that GFF3 is particularly tidy; it has an implicit relational table in the final column, and users are somehow supposed to know how to parse that. I guess the tidy representation makes that long, replacing the relation with replication. I would be more comfortable with that approach than GFF3.

jorainer commented 7 years ago

Just adding a little to that. I like @lawremi 's suggestion for one table containing all (genomic) gene, transcript and exon information. Still, I think some data should be put into a separate table. This includes e.g. the GO mappings @mtmorgan mentioned. Also, I would add one separate table providing protein annotations. In ensembldb I have them now, and they are also pretty rich with protein sequence, n:m mappings of protein ids to uniprot ids and mappings to protein domains with coordinates within the protein sequence.