Open jaclyn-taroni opened 6 years ago
😎
Do we want to limit to only datasets under a certain number of samples? Otherwise we could be spending a substantial amount of time doing analyses that people don't end up using. We need to benchmark this with datasets from 10 -> 1m samples or we need to put a limit on it somewhere.
That's a good point @cgreene -- I don't have that much insight into ComplexHeatmap's performance.
If someone ran a benchmark on datasets with 10, 100, and 1000 samples with the 1k highest variance genes and recorded the times over 10 repeats that would be informative. I like the idea but as I've thought about it this worries me a little bit.
@cansav09 is going to take a look at the benchmark as laid out above
Edit: Sorry, didn't realize the pngs weren't correctly printing out. Now that they are, the time is substantially slower than my first edit.
I ran ComplexHeatmap
with various sample sizes. I used fake data of 10,000 rows that I randomly extracted 10, 100, or 1000 samples from. For each dataset, I had it make a heatmap of the top 1k highest variance "genes"
Here are the average number of seconds for each sample size:
sample size: 10 100 1000
secs: 1.047840 2.012801 15.666819
Here's the script https://gist.github.com/cansav09/39fa5025b7190b828b23ae3dd71433f7 so you can see exactly what it was I timed.
For the YI summit (related: AlexsLemonade/CCDL#3), we should have examples for our usability evaluations. @cansav09 - can you snag GSE39842
from refine.bio and make a heatmap of the 1000 genes with the highest variance? I think we'd want no row names or row dendrogram and definitely colnames (which will be accession numbers) and a column dendrogram like the code above. Posting the PNG on this ticket should be fine.
Well, up to 1k samples doesn't seem to be a compute time problem. 😆
I'm also concerned about parameters like the presence or absence of the column names. If you have a large enough sample size, the column names being display probably becomes unhelpful.
Let me know if you would like anything adjusted:
Looks good @cansav09 — the only tweak I would make is the annotation bar. In any download we provide it would be difficult for us to automatically determine what labels are most useful for visualization, so I don’t think we’d include an annotation bar.
Ah. Right. Noted. Here's a revised one: @jaclyn-taroni
Thanks! I’ll snag that experiment from production and add this heatmap to the download folder so we can get some feedback.
Did you get required feedback from YI so that we can start trying to add this into the production pipeline?
cc @dvenprasad
The idea of a heatmap generally went over well with people but most of them mentioned wanting to see it in terms of genes instead.
I will say that the number of people I spoke to about this was rather low. So there are two things we can do: 1) Add the heatmap as is and test it in the next round 2) Hold off on implementing it until the next round of testing where I can show this as a sample download folder.
I'm leaning towards option 2.
Also, I'm curious as to what the heatmap will look like when the data is aggregated by species.
Context
Getting a high level view of the data values and structure as part of the download came up in tech team meeting, specifically, using a heatmap to illustrate this.
Problem or idea
We can include a heatmap for each gene expression matrix. I think we'd first filter to genes with high variance so we're not showing all genes in a gene expression matrix.
Solution or next step
ComplexHeatmap
to the smasher Docker imageFrom @cansav09 's example:
New Issue Checklist