Closed bradfordcondon closed 4 years ago
i was going to use my mini devseed, but i realize that isnt a great representation of your data since its mRNA focused instead of gene focused.
I will therefore load the bee genome from https://i5k.nal.usda.gov/content/data-downloads ... I should probably still minifiy it since space is a real concern.
ok so lets compare this to a DEFAULT gene page.
Note- i dont have any sequences appearing. This is a mistake on my end, i loaded scaffold, protein, and CDS fasta sequences and associated the sequences with the CDS. The fields by default on gene want to display the mRNA-associated sequences :(
Let me upload the mRNA sequences instead and they should appear here. edit: even doing so, the mRNA sequences dont appear. hmmmm i wonder if the field is bugged or not expecting how i loaded the data, let me look into it.....
http://167.99.232.220/gene/185
discussion of gene page improvements: https://github.com/tripal/tripal/issues/100
transcript field is not designed to display sequence, just wahts included in the field: https://github.com/tripal/tripal/blob/7.x-3.x/tripal_chado/includes/TripalFields/so__transcript/so__transcript.inc
Instead the datasequence field should display sequences- https://github.com/tripal/tripal/blob/7.x-3.x/tripal_chado/includes/TripalFields/data__sequence/data__sequence.inc . However per tripal's general philosophy, the sequence field only look as the targeted feature, not related features.
For example, the dataprotein_sequence field displays protein sequences.... [but only for features that are DIRECTLY CONNECTED to said feature](https://github.com/tripal/tripal/blob/a2f3c1e3414011adfced99e98b15365326904792/tripal_chado/includes/TripalFields/dataprotein_sequence/dataprotein_sequence.inc#L86-L97).
So as a defininite starting point, we want one or two fields that
In both cases we need to think about speed and display as we could have several child mRNA each with several child CDS, proteins, etc. We might choose to limit data to, for example, jUST retrieving the mRNA, CDS, and protein details.
I'm going to start with a "general all child feature" field, specific for gene, and go from there. The term i'll use is data:0916 from EDAM: "Gene report" .
https://www.ebi.ac.uk/ols/ontologies/edam/terms?iri=http%3A%2F%2Fedamontology.org%2Fdata_0916
there's also nucleic acid report (https://www.ebi.ac.uk/ols/ontologies/edam/terms?iri=http%3A%2F%2Fedamontology.org%2Fdata_2084) ifw e want to make it more feature agnostic.
or Sequence features: https://www.ebi.ac.uk/ols/ontologies/edam/terms?iri=http%3A%2F%2Fedamontology.org%2Fdata_1255
this last one might be too heavily tied to the "feature table format" whatever that is.
Here's my first step. Next step is to provide a clickable popup element for the sequences
I'm torn on if the popup should just be for sequence, or if it should have ALL extra information, including the featureloc (start, stop, strand) and the annotation definitions (right now the annotation field is just a list of annotation names. I need to load some example data so that it shows up here.....
I also had a "parent" field but in this data's case all the parents were the mRNA so I didn't include it...
@bradfordcondon would there be a way to tie the sequence retrieval into the Tripal collections module somehow?
We currently only have functionality for users to copy/paste from the popup. Ideally though they'd be able to download as fasta. Or both.
Collections are very very buggy right now. Unfortunately I can't recommend planning around them from an end user perspective (they are still very useful for admin purposes).
That said it would be easy and desirable to have an "add feature to collection" button. In this case the question then becomes collection of WHAT, right? Which feature type sequences?
We can have a separate discussion about that. but yes reading the second part of your message over again, im in agreement. the popup could offer both.
I've added a very simple popup window for each sequence.
Bug im aware of: that CDS has ::::: for annotations. It actually has no annotations, but that many unique start/stop locations, i forgot that table could have multiple entries so i am redoing the query.
I loaded the apis set @mpoelchau provided me with. here are the two genes:
Conclusions I draw:
We want to infer the sequence for each child feature if it doesnt have on directly associated (right?) If yes, then thats simple for some features, and we need to be cautious for others (CDS comes to mind... any others? My biology is rusty...)
The fact that the parent table is the top one, and the children table is the bottom, isnt clear at all. Let's maybe reformat the first one to just be two columns, key and value? And add the name, uniquename, etc, to make it more clear?
Additional: We need to link out to feature pages if they exist by checking if the entity exists.
Rather than try to cram all of the child feature info into a single field, the approach taken will be to
a) create a master index field which follows tripal web services best practice and has all the info needed for child features,
b) have other fields check that that field has loaded. Once it has, pull in the info they need.
The master field has a draft done: i put the code in my catch-all module for now: https://github.com/statonlab/tripal_manage_analyses/tree/gene_field
It stores an object looking like this in its . value:
[feature_id of first child] => [ info => ['array_of_info'], children => ['array_of_children keyed by feature_id']]
As you can see each node of the array has info describing that feature, and children which is a list of feature children associated with that feature. At each node, in info, we've crammed in all hte stuff we're going to want to pass along to other fields. This includes:
Now that the base index field is taken care of, I'll work on the childprop field.
Here's the child properties field in action:
we're on a gene page (FRAEX38873_v2_000001410). each mRNA (FRAEX38873_v2_000001410.1 & 2) has a collapsible fieldset with a table inside listing any featureprops associated with the mRNA or any child of the mRNA or any children of that child ie FRAEX38873_v2_000001410.2.cds3.
I'll follow this general design pattern for the other alternate fields: so the new annotations field, for example, will similarly be broken down into collapsible fieldsets by mRNA. We do this to keep the information manageable-- we expect different isoforms of the same transcript to have many of the same annotations.
I opened a PR into core https://github.com/tripal/tripal/pull/837
We have athree fields:
the base field that stores all info. It also maps out the feature:
There are some outstanding issues im working on.
I think as far as I5K is concerned, whats really missing is a nice way of displaying all the sequence information.
a note on collapsible fieldsets- I just merged a pr that makes them compatible with i5k's theme.
OK, this isnt amazing but it does work on all themes which was the trick- We now have a sequence column. Clicking on the word sequence expands it, clicking again hides it.
We use the chado_get_feature_sequences
API which in theory (but not in practice for i5k data i've tested) should find sequenced derived from mapped sequences.
Great discussion here but in the interim we're using the vanilla tripal gene pages. We'll resume the discussion on gene pages at a later date.
https://i5k.nal.usda.gov/CLEC010822 vs http://167.99.232.220/gene/185
My plan:
I'll get the demo site up so that i can link to compare screenshots and then fill out the rest of the below...
First impressions:
General
The Overview, sequences, and Transcript "tabs" are set up as tripal panes in tripal 3. It will take some custom styling to get them looking like you want (although the horizontal layout is a default option i think).
Overview tab
Pre-exsting Chado fields:
Depends on how its stored in the db:
Sequences
Transcripts
So CLEC010822-RA doesnt have its own page? I actually like this a lot. Something I dont like about Tripal 3 genes is they only display select info about child features, and link out to child genes proteins etc. I like it much better all displayed on the gene page.
hmm start/stop of each subfeature, i dontk now that this exists. Subfeatures get listed as relationships, and sequences get listed in their relevant fields. I can imagine this becoming a custom field.