Open andrewhill157 opened 9 years ago
Note that the clustering we talked about may help to some extent in having obvious groups of genes that you might be interested in, but either way this is definitely not trivial.
Hm, yeah, that's pretty dense. I think that having the heatmap filtered based on highlighted lineages will help, but that will probably not be enough on its own. We can see if the Cartesian distortion zooms enough, or we could perhaps actually implement panning and zooming on the heatmap (not sure how easy this is in D3).
I will take a crack at getting the expression data loaded and presented as a heatmap. From there we can decide whether to implement some kind of panning/zooming or distortion to allow users to see the individual gene/lineage tree labels.
Implemented a very rough first pass at this. (Commit coming shortly). I had to do some more work on the data file (I will email out the current version). As it turns out, javascript can only represent ints up to 32 bits, so it is impossible to represent our 227 genes as bits of an integer. I changed the code to simply store the expression pattern as a string of 1's and 0's. This makes the data file about 4x as large, but it still loads ok.
The code needs to be optimized to really be useful -- right now the early time points progress normally, but everything gradually grinds to a halt as more cells are added. One obvious candidate for optimization: It will be worth trying to store the expression data objects persistently instead of generating them anew at each update.
This implementation also highlights the density of the heatmap -- we are definitely going to need some way to zoom and/or filter the heatmap to make it useful.
The most recent commit (46def2eaa06a424efe893a2520a2740a6f944d08) contains some bug fixes and improvements to the gene expression plot. It is still too slow, especially once the time slider gets just under halfway across. Here are my current thoughts:
One reason the current implementation is slow is that it has to jump through a bunch of hoops to calculate what parts of the expression map need updating. This is because the gene expression data is hierarchical -- each cell contains all the gene expression values, unlike the data for the 3D plot, for which all the data points are stored on the same "level" as elements of arrays in csvdata. An efficient implementation will operate on cells, not cell-gene pairs as is currently the case, because whether or not a data point for a gene expression value is present is totally determined by whether or not the corresponding cell is present in the time point (i.e. the number of columns in the expression plot never changes). There might be a way to leverage D3 selection groups so that we can do a two-tiered update, one based on whether cells (rows) are present at the current time point, and the next based on whether the displayed gene expression patterns (values within the rows) should be updated.
I solved a bunch of the issues in my last post. Here's generally how the expression data are plotted now:
Setting up the heatmap in this way allows us to leverage the powerful data binding methods of D3 add/remove columns when cells divide and to add rect elements when new genes are expressed in a particular cell. Also, the plotting code only has to care about which genes are expressed; genes that are not expressed simply get the background color by omission.
If I load the vis and play immediately the plot is updating, but the heatmap does not seem to update properly when jumping around with the slider. I'm not sure if this is unique to my setup or if others are seeing this problem as well. I noticed this as of d72192c001f1905174b19424dd27cbe6cde33631, but I have not checked previous commits to see if I see the same thing.
As we discussed in our meeting, we would like to find some way to show expression on a global scale. This is challenging because the number of cells and genes can be quite high. For example, here is what a 500 x 250 matrix of random data looks like:
Even at this larger size, it is really hard to see. Distortion may help, but we may also need to consider whether there are ways to limit scope or reduce the number of individual rows and columns that need to be plotted.