Page gene #79

Closed fbastian closed 4 years ago

In GitLab by @marcrr on Oct 15, 2015, 12:07

One page per gene presenting synthetic information on expression etc, as well as gene-specific download files, and external links.

In GitLab by @marcrr on Oct 15, 2015, 17:10

Discussion of 15.10:

provide top tissues, ordered by mean rank
normalize mean rank per data type by Nmax/Ndata, Ndata = number of genes rankable with this data type
1st download file per gene, integrating the information form P/A and diff expression of all data types
simple search engine

later we will also do:

2nd download file per gene, integrating all RPKMs, micoarray levels, etc
tissue-specificity
box-plot
evolutionary conservation
maybe top tissues by graph

Attached: photo of white board of discussion.

In GitLab by @fbastian on Oct 16, 2015, 24:34

Need more information, @marcrr @pmoret (I've made some edits to my original comment)

Ranks and technical replicates: it doesn't seem fair to me to have 10 technical replicates pushing the mean rank of genes in a direction. Shouldn't we compute a mean rank over all technical replicates from an experiment in a condition, then use this rank as if it came from only one chip/library?

Shouldn't we also "normalize" ranks, inside a given data type, between conditions/chips/libraries? E.g., in a same condition, or different conditions, we could have different Affymetrix chip types used, with different number of genes represented. So I guess we should always "normalize", right?

Important for the following points: Actually, I think that we can easily pre-compute the mean ranks. We simply need to add one column per data type in the expression table, e.g., affyMeanRank, rnaSeqMeanRank, etc. But keep in mind that the expression table stores information per condition (= organ-stage), not per organ.
- There is no distinction to be made based on quality here I guess, right @marcrr? Otherwise, on what are we going to rank? Only on expressed genes / expressed high quality genes? I guess not, we are going to always rank on all genes studied in a condition.
- Implementation detail: a great advantage of pre-computing the ranks and storing them in the expression table is that it won't require new DAOs / new Services on the application side. And the rank for each data type could be stored by CallData objects, we made a great design :p
- Implementation detail: one possible implementation is to have a new table, with one column per data type, where each row would be a condition, to store total number of genes studied by each data type in this condition (for "normalizing" the ranks between data types). This would allow more flexibility than storing the "normalized" ranks.
- Or, we could store directly the "normalized" ranks in the expression table, and add a column to the tables affymetrixProbeset, rnaSeqResult, etc, to store the rank of each gene in a given chip/library. It would then be really easy to re-compute the "real" mean rank of a gene in a condition. But, if we want to display this information of "real" mean rank, then it doesn't make sense performance-wise to always re-compute this. Do we want to display this information of mean rank per data type?

We haven't discussed developmental stages. Should we make the ranking per condition (= organ-stage, maybe soon organ-stage-sex) rather than simply per organ? I would say yes. That would fit nicely with the use of the expression table as mentioned in previous point, where each row is a condition.
- Maybe we can first rank organs by taking the mean of the mean ranks of the conditions containing this organ (e.g., mean of the mean ranks of brain-adult + brain-CarnegieStage13 + brain-CargenieStage18; again, fit nicely with the use of the expression table). Then having the stages for each organ displayed somehow.
- Or should we have more or less a table per broad stage? E.g., rank organs in "embryo", and in "pupal", "larval", "nursing", "juvenile", "prime adult stage", "late adult stage" (when applicable, of course)
- Other visualization ideas?

Should we always display conditions with over-expression first? We hope that the mean rank for such conditions will be high, but that might not always be the case, e.g., in case of contradiction "over-expression" vs. numerous "no diff. expression" calls.
- Well, unless we display first only conditions with never-contradicted over-expression calls, even by a "no diff expression" call.
- What about conditions with under-expression calls? Display last? Don't bother?
- Ambiguous over-expression calls? Ambiguous under-expression calls? Ambiguous "over vs. under we have no idea"?

What should we do with no-expression calls? Don't display on interface, only in download file? In separate table(s)?

How do we deal with ranks of genes with identical expression values? Same rank? Averaging their rank?
So, for in situ data, we are going to have all genes with a rank of 1, the only difference between conditions arising from the "normalization"?

it should be made clear that what we are going to display are the expression calls. The mean rank is simply a way of ordering by what we hope to be "biological significance". It is not the main info to be displayed. Agreed?

In GitLab by @marcrr on Oct 19, 2015, 17:36

Ranks and technical replicates:

In my opinion average all replicates, technical or biological, before computing rank

Shouldn't we also "normalize" ranks, inside a given data type, between conditions/chips/libraries?

I would say between array types certainly yes.

Afterwards it gets fuzzy: different rnaseq coverages? There is no distinction to be made based on quality here I guess

I agree no distinction here on quality

Do we want to display this information of mean rank per data type?

"real rank" you mean before normalization? We may want to display this one day, but not yet, in my opinion.

Should we make the ranking per condition (= organ-stage, maybe soon organ-stage-sex) rather than simply per organ?

I agree in principle.

Or should we have more or less a table per broad stage?

makes sense for me

Should we always display conditions with over-expression first?

In my opinion over/under expression should not be taken into account here, and we should eventually visualize also top/all over and under expression on gene page.

What should we do with no-expression calls?

in download file only

How do we deal with ranks of genes with identical expression values?

should be taken care of by R function of ranks = average rank

So, for in situ data, we are going to have all genes with a rank of 1

yes

it should be made clear that what we are going to display are the expression calls. The mean rank is simply a way of ordering by what we hope to be "biological significance". It is not the main info to be displayed. Agreed?

agreed (Edited on ipad, sorry for formatting)

In GitLab by @fbastian on Oct 21, 2015, 15:14

In my opinion average all replicates, technical or biological, before computing rank

By "replicates", you mean, all samples from a same experiment in a same condition? Or all samples in a given condition? I would agree in the former case, it makes sense and it's easy to do.

I would say between array types certainly yes. Afterwards it gets fuzzy: different rnaseq coverages?

Yes, like RNA-Seq libraries targeting sRNAs. But OK, for now, let's consider that EST and RNA-Seq libraries always have access to the complete genome.

TODO: store in database the information about libraries targeting special types of RNAs, and "normalize" ranks only for those libraries.

should be taken care of by R function of ranks = average rank

We use perl :p

In GitLab by @marcrr on Oct 26, 2015, 11:17

By "replicates", you mean, all samples from a same experiment in a same condition?

yes.

In GitLab by @marcrr on Jan 13, 2016, 15:09

following discussion of today:

re-rank in situs
filtering of non informative anatomical terms
store and display quantiles per tissue-stage

In GitLab by @fbastian on Mar 1, 2016, 02:06

I think the computation of globalMeanRank in org.bgee.model.dao.mysql.expressiondata.MySQLExpressionCallDAO#generateSelectClause shouild be slightly rewritten: the denominator should be based on whether data exists for the gene, rather than whether a max rank exists for a condition.

E.g., current denominator is written: .../ (if (expression.estMaxRank is null, 0,expression.estMaxRank )+.... It should be rewritten to something like: .../ (if (expression.estData = 'no data', 0,expression.estMaxRank )+....

Same should be apply to numerator for consistency, e.g.: ...if (estData = 'no data', 0,expression.estMeanRankNorm * expression.estMaxRank)+

In GitLab by @fbastian on Mar 1, 2016, 02:24

And bonus point if you manage to get the quantile information to be displayed :p

BgeeDB / bgee_apps

Page gene #79