NAL-i5K / general_issues

for issues and discussions not tied to a specific repository
2 stars 0 forks source link

Moving the i5k gene page to tripal 3 #5

Closed bradfordcondon closed 4 years ago

bradfordcondon commented 6 years ago

https://i5k.nal.usda.gov/CLEC010822 vs http://167.99.232.220/gene/185

My plan:

I'll get the demo site up so that i can link to compare screenshots and then fill out the rest of the below...

First impressions:

General

The Overview, sequences, and Transcript "tabs" are set up as tripal panes in tripal 3. It will take some custom styling to get them looking like you want (although the horizontal layout is a default option i think).

Overview tab

Screen Shot 2018-10-30 at 2.17.28 PM.png

Pre-exsting Chado fields:

Depends on how its stored in the db:

Sequences

Transcripts

So CLEC010822-RA doesnt have its own page? I actually like this a lot. Something I dont like about Tripal 3 genes is they only display select info about child features, and link out to child genes proteins etc. I like it much better all displayed on the gene page.

hmm start/stop of each subfeature, i dontk now that this exists. Subfeatures get listed as relationships, and sequences get listed in their relevant fields. I can imagine this becoming a custom field.

bradfordcondon commented 6 years ago

i was going to use my mini devseed, but i realize that isnt a great representation of your data since its mRNA focused instead of gene focused.

I will therefore load the bee genome from https://i5k.nal.usda.gov/content/data-downloads ... I should probably still minifiy it since space is a real concern.

bradfordcondon commented 6 years ago

ok so lets compare this to a DEFAULT gene page.

Note- i dont have any sequences appearing. This is a mistake on my end, i loaded scaffold, protein, and CDS fasta sequences and associated the sequences with the CDS. The fields by default on gene want to display the mRNA-associated sequences :(

Let me upload the mRNA sequences instead and they should appear here. edit: even doing so, the mRNA sequences dont appear. hmmmm i wonder if the field is bugged or not expecting how i loaded the data, let me look into it.....

http://167.99.232.220/gene/185

screen shot 2018-10-31 at 9 49 49 am screen shot 2018-10-31 at 9 50 14 am

screen shot 2018-10-31 at 9 50 31 am

bradfordcondon commented 6 years ago

discussion of gene page improvements: https://github.com/tripal/tripal/issues/100

transcript field is not designed to display sequence, just wahts included in the field: https://github.com/tripal/tripal/blob/7.x-3.x/tripal_chado/includes/TripalFields/so__transcript/so__transcript.inc

Instead the datasequence field should display sequences- https://github.com/tripal/tripal/blob/7.x-3.x/tripal_chado/includes/TripalFields/data__sequence/data__sequence.inc . However per tripal's general philosophy, the sequence field only look as the targeted feature, not related features.
For example, the dataprotein_sequence field displays protein sequences.... [but only for features that are DIRECTLY CONNECTED to said feature](https://github.com/tripal/tripal/blob/a2f3c1e3414011adfced99e98b15365326904792/tripal_chado/includes/TripalFields/dataprotein_sequence/data
protein_sequence.inc#L86-L97).

So as a defininite starting point, we want one or two fields that

In both cases we need to think about speed and display as we could have several child mRNA each with several child CDS, proteins, etc. We might choose to limit data to, for example, jUST retrieving the mRNA, CDS, and protein details.

bradfordcondon commented 6 years ago

I'm going to start with a "general all child feature" field, specific for gene, and go from there. The term i'll use is data:0916 from EDAM: "Gene report" .

https://www.ebi.ac.uk/ols/ontologies/edam/terms?iri=http%3A%2F%2Fedamontology.org%2Fdata_0916

there's also nucleic acid report (https://www.ebi.ac.uk/ols/ontologies/edam/terms?iri=http%3A%2F%2Fedamontology.org%2Fdata_2084) ifw e want to make it more feature agnostic.

or Sequence features: https://www.ebi.ac.uk/ols/ontologies/edam/terms?iri=http%3A%2F%2Fedamontology.org%2Fdata_1255

this last one might be too heavily tied to the "feature table format" whatever that is.

bradfordcondon commented 6 years ago

Here's my first step. Next step is to provide a clickable popup element for the sequences

screen shot 2018-11-05 at 3 11 21 pm

I'm torn on if the popup should just be for sequence, or if it should have ALL extra information, including the featureloc (start, stop, strand) and the annotation definitions (right now the annotation field is just a list of annotation names. I need to load some example data so that it shows up here.....

I also had a "parent" field but in this data's case all the parents were the mRNA so I didn't include it...

mpoelchau commented 6 years ago

@bradfordcondon would there be a way to tie the sequence retrieval into the Tripal collections module somehow?

We currently only have functionality for users to copy/paste from the popup. Ideally though they'd be able to download as fasta. Or both.

bradfordcondon commented 6 years ago

Collections are very very buggy right now. Unfortunately I can't recommend planning around them from an end user perspective (they are still very useful for admin purposes).

That said it would be easy and desirable to have an "add feature to collection" button. In this case the question then becomes collection of WHAT, right? Which feature type sequences?

We can have a separate discussion about that. but yes reading the second part of your message over again, im in agreement. the popup could offer both.

bradfordcondon commented 6 years ago

I've added a very simple popup window for each sequence.

screen shot 2018-11-06 at 11 37 37 am

Bug im aware of: that CDS has ::::: for annotations. It actually has no annotations, but that many unique start/stop locations, i forgot that table could have multiple entries so i am redoing the query.

I loaded the apis set @mpoelchau provided me with. here are the two genes:

screen shot 2018-11-06 at 11 33 21 am

screen shot 2018-11-06 at 11 33 05 am

Conclusions I draw:

We want to infer the sequence for each child feature if it doesnt have on directly associated (right?) If yes, then thats simple for some features, and we need to be cautious for others (CDS comes to mind... any others? My biology is rusty...)

The fact that the parent table is the top one, and the children table is the bottom, isnt clear at all. Let's maybe reformat the first one to just be two columns, key and value? And add the name, uniquename, etc, to make it more clear?

Additional: We need to link out to feature pages if they exist by checking if the entity exists.

bradfordcondon commented 5 years ago

Rather than try to cram all of the child feature info into a single field, the approach taken will be to

a) create a master index field which follows tripal web services best practice and has all the info needed for child features,

b) have other fields check that that field has loaded. Once it has, pull in the info they need.

The master field has a draft done: i put the code in my catch-all module for now: https://github.com/statonlab/tripal_manage_analyses/tree/gene_field

It stores an object looking like this in its . value:

[feature_id of first child] => [ info => ['array_of_info'], children => ['array_of_children keyed by feature_id']]

As you can see each node of the array has info describing that feature, and children which is a list of feature children associated with that feature. At each node, in info, we've crammed in all hte stuff we're going to want to pass along to other fields. This includes:

Now that the base index field is taken care of, I'll work on the childprop field.

bradfordcondon commented 5 years ago

Here's the child properties field in action:

screen shot 2018-12-19 at 11 38 19 am

we're on a gene page (FRAEX38873_v2_000001410). each mRNA (FRAEX38873_v2_000001410.1 & 2) has a collapsible fieldset with a table inside listing any featureprops associated with the mRNA or any child of the mRNA or any children of that child ie FRAEX38873_v2_000001410.2.cds3.

I'll follow this general design pattern for the other alternate fields: so the new annotations field, for example, will similarly be broken down into collapsible fieldsets by mRNA. We do this to keep the information manageable-- we expect different isoforms of the same transcript to have many of the same annotations.

bradfordcondon commented 5 years ago

I opened a PR into core https://github.com/tripal/tripal/pull/837

We have athree fields:

the base field that stores all info. It also maps out the feature:

image

image

There are some outstanding issues im working on.

I think as far as I5K is concerned, whats really missing is a nice way of displaying all the sequence information.

bradfordcondon commented 5 years ago

a note on collapsible fieldsets- I just merged a pr that makes them compatible with i5k's theme.

bradfordcondon commented 5 years ago

OK, this isnt amazing but it does work on all themes which was the trick- We now have a sequence column. Clicking on the word sequence expands it, clicking again hides it.

We use the chado_get_feature_sequences API which in theory (but not in practice for i5k data i've tested) should find sequenced derived from mapped sequences.

Screen Shot 2019-03-25 at 1 25 11 PM

Screen Shot 2019-03-25 at 1 25 17 PM

mpoelchau commented 4 years ago

Great discussion here but in the interim we're using the vanilla tripal gene pages. We'll resume the discussion on gene pages at a later date.