AlexsLemonade / refinebio

Refine.bio harmonizes petabytes of publicly available biological data into ready-to-use datasets for cancer researchers and AI/ML scientists.
https://www.refine.bio/
Other
129 stars 19 forks source link

Include PNG of heatmap with each gene expression matrix #716

Open jaclyn-taroni opened 6 years ago

jaclyn-taroni commented 6 years ago

Context

Getting a high level view of the data values and structure as part of the download came up in tech team meeting, specifically, using a heatmap to illustrate this.

Problem or idea

We can include a heatmap for each gene expression matrix. I think we'd first filter to genes with high variance so we're not showing all genes in a gene expression matrix.

Solution or next step

From @cansav09 's example:

# Attach the library
library(ComplexHeatmap)

# Calculate the variance for each gene
variances <- apply(df, 1, var)

# Determine summary statistics for gene variances
sum.stats.var <- summary(variances)

# Subset the data choosing only genes whose variances are in the upper quartile
df.by.var <- df[which(variances > sum.stats.var[5]), ]

# Create the heatmap object
heatmap <- Heatmap(df.by.var, 
        name = "Gene_Expression",
        show_row_names = FALSE,
        show_row_dend = FALSE,   # Can show the gene/row cluster if this is 
        #changed to TRUE
        column_dend_height = unit(4, "cm"))

# Open a png file
png("HeatmapGSE12955.png")

# Print your heatmap
heatmap

# Close the png file:
dev.off()

New Issue Checklist

Miserlou commented 6 years ago

😎

cgreene commented 6 years ago

Do we want to limit to only datasets under a certain number of samples? Otherwise we could be spending a substantial amount of time doing analyses that people don't end up using. We need to benchmark this with datasets from 10 -> 1m samples or we need to put a limit on it somewhere.

jaclyn-taroni commented 6 years ago

That's a good point @cgreene -- I don't have that much insight into ComplexHeatmap's performance.

cgreene commented 6 years ago

If someone ran a benchmark on datasets with 10, 100, and 1000 samples with the 1k highest variance genes and recorded the times over 10 repeats that would be informative. I like the idea but as I've thought about it this worries me a little bit.

jaclyn-taroni commented 6 years ago

@cansav09 is going to take a look at the benchmark as laid out above

cansavvy commented 6 years ago

Edit: Sorry, didn't realize the pngs weren't correctly printing out. Now that they are, the time is substantially slower than my first edit.

I ran ComplexHeatmap with various sample sizes. I used fake data of 10,000 rows that I randomly extracted 10, 100, or 1000 samples from. For each dataset, I had it make a heatmap of the top 1k highest variance "genes"

samplesizebyruntime

Here are the average number of seconds for each sample size:

sample size: 10 100 1000
secs: 1.047840 2.012801 15.666819

Here's the script https://gist.github.com/cansav09/39fa5025b7190b828b23ae3dd71433f7 so you can see exactly what it was I timed.

jaclyn-taroni commented 6 years ago

For the YI summit (related: AlexsLemonade/CCDL#3), we should have examples for our usability evaluations. @cansav09 - can you snag GSE39842 from refine.bio and make a heatmap of the 1000 genes with the highest variance? I think we'd want no row names or row dendrogram and definitely colnames (which will be accession numbers) and a column dendrogram like the code above. Posting the PNG on this ticket should be fine.

cgreene commented 6 years ago

Well, up to 1k samples doesn't seem to be a compute time problem. 😆

jaclyn-taroni commented 6 years ago

I'm also concerned about parameters like the presence or absence of the column names. If you have a large enough sample size, the column names being display probably becomes unhelpful.

cansavvy commented 6 years ago

Let me know if you would like anything adjusted: heatmapgse39842

jaclyn-taroni commented 6 years ago

Looks good @cansav09 — the only tweak I would make is the annotation bar. In any download we provide it would be difficult for us to automatically determine what labels are most useful for visualization, so I don’t think we’d include an annotation bar.

cansavvy commented 6 years ago

Ah. Right. Noted. Here's a revised one: @jaclyn-taroni heatmapgse39842

jaclyn-taroni commented 6 years ago

Thanks! I’ll snag that experiment from production and add this heatmap to the download folder so we can get some feedback.

Miserlou commented 6 years ago

Did you get required feedback from YI so that we can start trying to add this into the production pipeline?

jaclyn-taroni commented 6 years ago

cc @dvenprasad

dvenprasad commented 6 years ago

The idea of a heatmap generally went over well with people but most of them mentioned wanting to see it in terms of genes instead.

I will say that the number of people I spoke to about this was rather low. So there are two things we can do: 1) Add the heatmap as is and test it in the next round 2) Hold off on implementing it until the next round of testing where I can show this as a sample download folder.

I'm leaning towards option 2.

Also, I'm curious as to what the heatmap will look like when the data is aggregated by species.