Traits are assumed to be log-scaled

zsloan commented 9 years ago

From @lomereiter on May 6, 2015 12:10

E.g. page for the trait with id 10001 (where the unit is mg/dl) shows inadequate range statistics

trait10001

Copied from original issue: zsloan/genenetwork2#19

zsloan commented 9 years ago

Yeah, our Range statistic not being calculated properly is an issue we're aware of.

On Wed, May 6, 2015 at 7:10 AM, Artem Tarasov notifications@github.com wrote:

E.g. page for the trait with id 10001 (where the unit is dg/ml) shows inadequate range statistics

[image: trait10001] https://cloud.githubusercontent.com/assets/662555/7492577/ee125ab4-f401-11e4-88ff-b9d62929e62f.png

— Reply to this email directly or view it on GitHub https://github.com/zsloan/genenetwork2/issues/19.

lomereiter commented 9 years ago

I see that there is some handling in GN1, added by @lyan6 in commit https://github.com/genenetwork/genenetwork/commit/df142e1cafcafe500aa535ceabb7a56b17c8fac9. That uses a new DataScale column, so a test database should be created first (https://github.com/genenetwork/genenetwork2/issues/32)

lomereiter commented 9 years ago

I'm now working on it, and GN1 code implies that DataScale can be one of the following: z_score, linear_positive, linear, or log2. However, querying the database reveals that there's no such thing as linear_positive, but there's also a log2_ratio scale. What does the latter means?

lomereiter commented 9 years ago

Also, why are range statistics in GN1 shown only for mRNA data and not for the phenotypes? Is that intended or a bug?

robwwilliams commented 9 years ago

Artem and team: log2 ratio is a mistake I made. Should probably just be log_ratio. I was trying to find a name for the scale used in several data sets produced at UCLA and by Merck.

http://bioinformatics.oxfordjournals.org/content/19/8/956.long

What they do is compute a log ratio of the expression in tissue X (e.g., liver) to expression in a pool of multiple tissues (which may even include liver).

Lei, Artem: I believe that any expression data set (mRNA data set) using a microarray that has values with a mean of around 0 will be logRatio data. Examples include: Mouse:
Group BDF2 UCLA BH/HB F2 UCLA BHF2 (ApoeNull) UCLA BXD Liver mRNA: UNC Agilent G64121A Liver ... CastB6/B6Cast F2 UCLA

A data set annotator or the data entry person (me or Arthur) should be responsible for making sure that these scales are defined upon data entry and in our metadata files. The programmers should just think about how to give us an interface to enter this information into the correct database table.

ARTEM: Most data in the Phenotype databases is linear_positive. For example, weight data. Some phenotypes are residuals and will simply be linear. Perhaps we need to think more deeply about the scale issue, but the key thing is to them have an interface that allows the data entry person, me, or Arthur to specify the scale of a phenotype or large data set.

robwwilliams commented 9 years ago

Artem:

About the lack of statistics for phenotypes. This is an "intentional" bug due to poorly structured GN1 code that currently assumes that all data are on a log2 scale. Lei Yan intended to fix this, but it has not happened yet. The intent is to use the scale information in the database to allow the statistics code to select the appropriate procedures as a function of scale type. Here is the stupidly wrong way that GN1 handles non-logged data (this is for the BXD Fecal Metabolome data).

screen shot 2015-06-24 at 4 53 33 pm

lomereiter commented 9 years ago

OK, here's my attempt at fixing this: https://gist.github.com/lomereiter/cd82e9d31cbde415d08d (until #95 is sorted out, I can't make PRs so meanwhile I'll resort to patches)

Ranges are shown only for ProbeSet data because currently only for them metadata is available (DataScale column)
For log2 data, I decided to just show interquartile range according to definition, 2^q3 - 2^q1; the definition in the code now is 2^(q3 - q1), i.e. ratio of the quartiles.
I think it should be either q3 - q1 (interquartile range on the log2 scale) or 2^q3 - 2^q1 (on the usual scale), the ratio is counterintuitive.
Example (view on GN2):
For linear data, there are fewer nuances. Am I correct in understanding that if there are values of both +/- signs, Range (fold) row should be dropped?
Example (GN2):
Final question: are we going to add boxplots? I've seen traces of them in the codebase, such as this file but they are not among the plots.

zsloan commented 8 years ago

This issue seems to be fixed now.

genenetwork / genenetwork2

Traits are assumed to be log-scaled #2