PavlidisLab / Gemma

Genomics data re-analysis
Apache License 2.0
18 stars 6 forks source link

Prioritize Map-able probes in "Visualize Expression" tab #90

Closed nzllim closed 2 days ago

nzllim commented 4 years ago

This tends to be over-represented in microarrays where many probes cannot be mapped to known genes, resulting in the heatmap showing data for unmapped probes most (if not all) of the time. I propose a minor change to the "randomization scheme", such that at least 15 probes are actually mapped to genes, with the remainder being the unmapped probes (useful as a control). In the event there isn't enough mappable probes, then the remainder can be used by the unmapped probes (this latter option can be omitted if the coding becomes unnecessarily tedious).

ppavlidis commented 1 year ago

A suggestion was made in #593 to only show genes that are expressed when we visualize randomly-selected genes.

I don't know if that is desirable, because it would create more overhead to choose them, and also the point is to show what the data actually looks like at a glance. So biasing it towards highly expressed genes could be misleading.

I am in agreement that it would be better to show a truly random selection, if that can be done without slowing the query down very much.

The main benefit of the current approach is that it is very fast: we pick a random set of consecutive genes as they appear in primary key order. Having to click a few times to see something "interesting" is the cost.

To make it random, we'd have to avoid retrieving/querying the entire data set just to show 20 genes worth of data.

This just isn't a very high priority right now because there is a viable workaround: click the button a few times, or query for genes of interest.

arteymix commented 1 year ago

We can leverage the rank information (rank by mean and max) to pick a representative set efficiently by uniformly sampling percentiles to display and retrieve the closest genes.

arteymix commented 1 year ago

In this specific case, we would draw 20 values from a random uniform distribution U = [0,1] and pick the 20 genes that are the closest. Now the question would be whether max or mean is most appropriate.

arteymix commented 1 year ago

I'm scheduling this for the 1.31. I really want to wrap up the 1.30 series for GemBrow ASAP.

ppavlidis commented 1 year ago

This is definitely not a priority as I've mentioned. It's why it has been sitting here for so long.

I was also thinking we might be able to use the ranks, and which metric is used doesn't matter if you are randomly sampling. But ... as I recall the rank is stored as a float, so you can't index it. So we'd have to do (sub)table scans to find the right values. We might as well retrieve all the PKs of the data vectors for the experiment and randomly pick, or maybe there some SQL trick I am missing. My concern again is about speed, but maybe it's fine.

arteymix commented 1 year ago

You're right. With a single value, you would just order by difference and pick the minimum, but for multiple values there's no clean solution. It would be more efficient to pick all EVs IDs with their corresponding ranks (sorted) and then do a few binary searches to get the closest entries.

ppavlidis commented 1 year ago

Once you've retrieved all the vector IDs, there's no point in trying to pick particular expression levels; you just pick 20 random IDs among that pool.

If that's as fast as the current method (or near enough - bearing in mind one still might want to click the 'random' button more than once) - great.

If we actually want a constraint on expression levels you'd have to add AND RANK > 0.5 to the query or whatever, but I am recommending against that; it would slow things down even more, and bias the results in a way that I don't think is desirable for this use case

sanjarogic commented 1 year ago

One more thing to add here, which I don't think was mentioned before, is that there aren't always 20 genes that are displayed in the heatmap. Sometimes it's zero, like I showed in #593, and sometimes it's some random number <20, like shown below:

Screen Shot 2023-03-10 at 10 45 45 AM

arteymix commented 1 year ago

@sanjarogic This is just an artifact of data masking when selecting probes randomly from a platform. By sampling masked vectors, it will not pick probes that are not expressed.

ppavlidis commented 9 months ago

I came across the ORDER BY RAND() LIMIT n syntax in MySQL (and I believe this is supported in hql), which if it's still fast, might be a shade better than the current selection of consecutive probes.

Again, it's important that this be very fast so that clicking Visualize returns an immediate result.

arteymix commented 9 months ago

If not supported, I know how to map custom functions and generate dialect-specific SQL.

ppavlidis commented 2 months ago

Because it came up again to improve this, I did some more investigation and have some fixes.

  1. A reason that sometimes you don't get the expected number of "random" vectors (default is 20) is when the data for a particular randomly-selected gene has no variance (typically this means it was all zeros and this primarily only happens to RNA-seq data). This is because while there are ~25k+ genes of dubious reality in the genome references (in addition to the ~20k well-documented protein coding genes), and they tend not to show any expression. We will fix this so constant genes are still displayed and that even data with all NAs are still shown (basecode: 49fe093ec2749acca3c6567b890e85eb9db930c6). But item 3 below should reduce that happening.

  2. In order to make the selection of data more "random", I am experimenting with doing multiple queries. This slows things down but might not be noticeable. It seems like it might not be necessary, though due to item 4.

  3. As mentioned above, we can limit to data that has relatively high expression levels by using the recorded expression level rank. This has a big effect on reducing the amount of garbage (non-expressed, non-genes) in the output, but also introduces a bias; the scare-quotes around "random" in the UI are not joking. I'm experimenting with a value of 0.5, which is predicated on the idea that 1) for RNA-seq at least, half the data is probably mostly zero and 2) for microarrays, we typically consider the bottom 1/3 of the data to be unexpressed but it is harder to draw a line. 0.5 seems reasonable. The performance hit isn't very noticable.

  4. I also added ORDER by RAND() to the HQL and it seems to work. HQL doesn't support this syntax despite what I said earlier, but it doesn't complain, either and it may be getting passed through (annoying to check actual generated SQL)

Consider it a work in progress, we'll see how it goes.

About the original suggestion: Limiting to protein-coding genes or something like that is harder because at the stage we are querying, we don't know the genes and fetching that would slow things down even more. Instead, we get data and fill in the gene information afterwards. But in my tests, the changes above seem like they will improve things quite a bit.

arteymix commented 1 month ago

@ppavlidis has this been resolved in 1.31.3? I head that Sanja was complaining about the speed.

ppavlidis commented 1 month ago

It's resolved but apparently it doesn't behave right if there is no data for the data set, I have to look at it.

I don't think it's possible to satisfy the requirement to prioritize certain types of genes without making it slower, but in my tests the differences wasn't too noticeable.