Implement initial heatmap representation of global expression - Githubissues

CSE512-15S / fp-tdurham-ajh24-chiasson-ningli30

EPICViz - http://cse512-15s.github.io/fp-tdurham-ajh24-chiasson-ningli30/

0 stars 3 forks source link

Implement initial heatmap representation of global expression #19

Open andrewhill157 opened 9 years ago

andrewhill157 commented 9 years ago

As we discussed in our meeting, we would like to find some way to show expression on a global scale. This is challenging because the number of cells and genes can be quite high. For example, here is what a 500 x 250 matrix of random data looks like: screen shot 2015-05-18 at 6 26 19 pm

Even at this larger size, it is really hard to see. Distortion may help, but we may also need to consider whether there are ways to limit scope or reduce the number of individual rows and columns that need to be plotted.

andrewhill157 commented 9 years ago

Note that the clustering we talked about may help to some extent in having obvious groups of genes that you might be interested in, but either way this is definitely not trivial.

tdurham86 commented 9 years ago

Hm, yeah, that's pretty dense. I think that having the heatmap filtered based on highlighted lineages will help, but that will probably not be enough on its own. We can see if the Cartesian distortion zooms enough, or we could perhaps actually implement panning and zooming on the heatmap (not sure how easy this is in D3).

tdurham86 commented 9 years ago

I will take a crack at getting the expression data loaded and presented as a heatmap. From there we can decide whether to implement some kind of panning/zooming or distortion to allow users to see the individual gene/lineage tree labels.

tdurham86 commented 9 years ago

Implemented a very rough first pass at this. (Commit coming shortly). I had to do some more work on the data file (I will email out the current version). As it turns out, javascript can only represent ints up to 32 bits, so it is impossible to represent our 227 genes as bits of an integer. I changed the code to simply store the expression pattern as a string of 1's and 0's. This makes the data file about 4x as large, but it still loads ok.

The code needs to be optimized to really be useful -- right now the early time points progress normally, but everything gradually grinds to a halt as more cells are added. One obvious candidate for optimization: It will be worth trying to store the expression data objects persistently instead of generating them anew at each update.

This implementation also highlights the density of the heatmap -- we are definitely going to need some way to zoom and/or filter the heatmap to make it useful.

tdurham86 commented 9 years ago

The most recent commit (46def2eaa06a424efe893a2520a2740a6f944d08) contains some bug fixes and improvements to the gene expression plot. It is still too slow, especially once the time slider gets just under halfway across. Here are my current thoughts:

One reason the current implementation is slow is that it has to jump through a bunch of hoops to calculate what parts of the expression map need updating. This is because the gene expression data is hierarchical -- each cell contains all the gene expression values, unlike the data for the 3D plot, for which all the data points are stored on the same "level" as elements of arrays in csvdata. An efficient implementation will operate on cells, not cell-gene pairs as is currently the case, because whether or not a data point for a gene expression value is present is totally determined by whether or not the corresponding cell is present in the time point (i.e. the number of columns in the expression plot never changes). There might be a way to leverage D3 selection groups so that we can do a two-tiered update, one based on whether cells (rows) are present at the current time point, and the next based on whether the displayed gene expression patterns (values within the rows) should be updated.

tdurham86 commented 9 years ago

I solved a bunch of the issues in my last post. Here's generally how the expression data are plotted now:

a root SVG element is set up to hold the gene expression plot. It is populated immediately with a rect object that takes up the full space and is filled with white (or whatever we decide the 'non-expressed' color should be).
The data in the plot are set up in two phases:
1. First, all cells at the current time point are retrieved from csvdata and bound to svg elements that are appended to the root svg. These svg elements form the columns in the expression heatmap plot. They are positioned according to the d3 ordinal scale called expr_cell_scale, for which the domain is updated at each time point with the cell names for that time point in the namemap. They are also sized so that they take up the entire height of the root svg, and so that their widths are even across as much of the width as possible (the expr_cell_scale uses rangeRoundBands() to set the x-dimension coordinates and widths).
2. Next, the code iterates over all of the column svg elements and checks the expr property of that svg's data. The expr property contains indices into the gene_names list corresponding to genes that are expressed in that cell. These indices are bound as data to rect objects that are appended to the column svg element and positioned according to the d3 ordinal scale expr_gene_scale, which in turn is set according to the gene_names list. This scale controls the height of the rects, while the widths are set to be 100% of the column svg width. Last, the color of the rect is set to be the value set on the color property of the column svg's data.

Setting up the heatmap in this way allows us to leverage the powerful data binding methods of D3 add/remove columns when cells divide and to add rect elements when new genes are expressed in a particular cell. Also, the plotting code only has to care about which genes are expressed; genes that are not expressed simply get the background color by omission.

andrewhill157 commented 9 years ago

If I load the vis and play immediately the plot is updating, but the heatmap does not seem to update properly when jumping around with the slider. I'm not sure if this is unique to my setup or if others are seeing this problem as well. I noticed this as of d72192c001f1905174b19424dd27cbe6cde33631, but I have not checked previous commits to see if I see the same thing.