Prepare end-to-end small workflow #13

Closed fbastian closed 4 years ago

In GitLab by @vioannid on Jul 14, 2015, 17:17

Prepare an end-to-end workflow as described below:

input file with 20 gene ids of the same species
submit to the server (use only default parameters)
get the resulting json data

Requirements: function call(s) to Bgee server to get the output data

In GitLab by @fbastian on Jul 14, 2015, 18:01

It would be good to mock the JSON data supposed to be returned, for clearly defining them, and for not waiting for the features to be implemented in Bgee.

In GitLab by @fbastian on Jul 14, 2015, 18:44

Actually, I see two or three workflows:

the gene ID verification/species detection/parameters checking workflow
the launch analysis/track advancement workflow
the retrieve/render results workflow

What do you think, do you agree with those definitions? Should we tackle each one of them successively?

In GitLab by @sduvaud on Jul 15, 2015, 13:37

Gene verification/species detection: should be dynamic. When a user adds Ensembl gene identifiers, we should check:

whether the identifier is of the right format
whether it corresponds to a species in Bgee
whether there are one or many species corresponding to the Ensembl gene identifiers
choose which species should be analysed.

Obviously, we wont request the Bgee web service each time a user loads a list of genes in TopAnat. We have to achieve this check on the client's side.

My first idea was to use the $html service in order to fetch, parse and assign the list of ensembl gene identifiers' prefixes with their corresponding species in JSON format. The problem is the origin of the JSON data. It is either we get it from the server on startup or we add it to the Bgee angular project. The latter solution is not convenient for Frederic because this supposes to add a step at each release of Bgee.

We decided to try the following: Frederic will add a JSON snippet in a script tag during the HTML generation. I should check whether Angular can deal with this solution (and how).

In GitLab by @sduvaud on Jul 15, 2015, 15:50

The retrieve/render results workflow: Story board (see attached file).

The parameters are of 3 types:

data
analysis
graph

We will ignore the graph, for now.

The parameters from the "analysis" group are the following:

decorrelation type
numerical options
foreground (genome or gene list)

Although those are very important for the computation of data, they won't appear in the result table itself (but somewhere in a "Parameter summary" panel or whatever). Do they need to be part of the JSON returned data? For now, I don't think so.

The parameters from the "data" category are composed of:

the gene list (input data) + species
the expression type (presence/diff - query filter)
the quality level flag (all/high - query filter)
the analysis type (RNA-seq, Affymetrix, in situ hybridization, EST - query filter)
development stage (ontology - query filter)
anatomical structure (ontology - query filter).

Dependencies:

expression types <-> analysis types (beware: incompatibilities!)
development stage <-> input data (more precisely, species detected)
foreground <-> background

The result table, which will be built from the returned JSON data, should contain:

Gene identifier (linked to Ensembl)
Gene name/family (?)
Development stage (proper name + link to ontology)
Anatomy (proper name + link to ontology)
Analysis type
Expression (y/n)
Differential expression (+/-)

Possible JSON output: (no query filter)

{ "EnsemblId": "ENS0000000000", "GeneName": "myGene", "DevelStageId": "HsapDv:0000092", "DevelStageName": "human adult stage (human)", "AnatId": "CL:0000655", "AnatName": "secondary oocyte", "DataType": "RNA-seq", "Expression": "Absent" "DiffExpression": "over-expressed" },

The parameters panel will display:

the species of interest
the foreground
the quality level
the decorrelation type
the numerical options storyboard_Bgee_TopAnat_WebTeam_Jun2015.pptx

In GitLab by @fbastian on Jul 15, 2015, 17:08

Some comments:

We will ignore the graph, for now.

Actually, no, we were just not sure whether this parameter should be put directly into the "analysis" parameters. We already use this parameter in our prototype, to generate a ugly graph, so it would be used even if cytoscape is not used to generate a graph for now.

the analysis type (RNA-seq, Affymetrix, in situ hybridization, EST - query filter)

This is more what we call "data type".

anatomical structure (ontology - query filter).

I don't see such a parameter in the storyboard.

The result table, which will be built from the returned JSON data, should contain:

The result table won't contain gene IDs. See the example output files described in issue #12. The results are about retrieving organs (your point 4), not about retrieving genes.
It is true that we will add some "analysis" parameters in the returned results (developmental stage ID and name, expression type), as compared to the example output in issue #12.
About expression type, in the returned results, it won't be 'Expression (y/n)' and 'Differential expression (+/-)', it will be 'expression type (presence/differential expression)' (as the parameters in the form).
Point 5 'analysis type', I think you're speaking about 'data type', this will not be in the returned results, it will be in the "parameters panel".
In the comments of the storyboard (slide 7), the column in the returned results were defined as:

Uberon ID   Uberon name Stage ID    Stage name  Expression type obs.    exp.    enrich. p-val   FDR

In GitLab by @sduvaud on Jul 15, 2015, 17:40

Take-home message: we work at the level of a gene list!!! What we want to know is whether a list of genes is expressed in specific organs. This was unclear for me.

Current output in the prototype (as described in #12):

OrganId OrganName   Annotated   Significant Expected    foldEnrichment  p   fdr
XAO:0000305 cranial placode 5   3   0.01    300.00  1.04e-07    7.63e-07
XAO:0003196 olfactory system    18  5   0.04    125.00  1.10e-05    5.04e-05

As Frederic said, we will add the developmental stage IDs+name together with the expression type (presence/differential expression).

I will start with this format and use a fake JSON as a result output.

In GitLab by @fbastian on Jul 15, 2015, 17:52

This sounds good!

(point of detail: it's not exactly whether a list of genes is expressed in specific organs, but whether their expression is enriched as compared to the background in specific organs)

In GitLab by @sduvaud on Jul 23, 2015, 14:33

I will add the analysis type to the JSON for the view-by analysis view (the alternative of the default "grouped result view").

I presume that the "AnalysisType" file contain "RNA-seq", "In situ...", "EST", ... but I am not 100% sure. I will start with that and will check with Frederic once we are all back.

JSON used for mocking the application:

{
"AnalysisType": "",
"OrganId": "",
"OrganName": "",
"Annotated": ,
"Significant": ,
"Expected": ,
"foldEnrichment": ,
"p": ,
"fdr": ,
"DevelopmentStage": "",
"DevelopmentStageName": "",
"ExpressionType": ""
},

In GitLab by @sduvaud on Jul 24, 2015, 14:24

The first version of the web interface was pushed on gitlab: https://gitlab.isb-sib.ch/ST/topanat-web

Features:

gene list textarea
advanced parameters
basic checks
submit button
summary of the parameters
message panel
retrieval of the mock JSON file
display of the result in a table
corresponding JUnit tests.

Missing:

e2e tests (not part of Yeoman)
integration of the resulting interface to the Bgee code.

In GitLab by @fbastian on Aug 17, 2015, 09:51

The link to the first version is broken :(

I guess that by AnalysisType, you're referring to the display of results "per analysis". An "analysis" corresponds to an expression type ('presence', 'diff expression') for a specific developmental stage. So I think you don't need a specific column for this.

Also, for consistency with the current Bgee application, you should use the term 'anatEntity' rather than 'organ' (anatEntityId, anatEntityName). Same for 'DevelopmentStage', replace it with 'devStage' (devStageId, devStageName). We'll see later which term we use in column headers.

In GitLab by @fbastian on Aug 31, 2015, 14:19

I generated an output TSV file with much more terms (2,300), to test the performances for generating/sorting/searching results client-side (e.g., with data-table): http://devbgee.unil.ch/bgee/TopOBOFiles/results/topOBOResult_8bc083c9ad40e2aa2e02e681801870906d1b41ec.tsv

Note that we should never have so many results (we could easily limit to, e.g., 500 terms per analysis, or less).

You can manipulate the results at the URL: http://devbgee.unil.ch/bgee/bgee?page=top_anat&data=9c067119742baa038c83c648b588c19f6450ed1e

(These data will certainly be removed some times in the future, as they were generated with non-sense parameters, to get lots of terms)

BgeeDB / bgee_apps

Prepare end-to-end small workflow #13