Deliverable 2/16 & 2/20

bstadt commented 7 years ago

Status	Deliverable	Notes
Complete	comparison of slice norm techniques (mean, max, window)	Brandon for Thurs
Complete	Hough Transform Alg md	Brandon for Monday
Complete	Matching correspondence for evaluating pipeline for simulated data	Richard for Thurs
Complete	Investigate light microscopy synapse detection	Richard
Complete	Sparse Matrix Conversion	Will for Monday
Complete	Scipy volume and centroid implementation	Will for Thurs

Norm Comparisons: https://github.com/NeuroDataDesign/pan-synapse/blob/master/background/normComparison.ipynb

Will's Notebooks: Scipy Volume/Centroid Implementation Investigation: https://github.com/NeuroDataDesign/pan-synapse/blob/master/background/Cluster_Components_Class_Algorithms.md.ipynb Trying to create my own sparse arrays: https://github.com/NeuroDataDesign/pan-synapse/blob/master/background/Sparse_Arrays_Algorithms.md.ipynb Sparse Connected Components: https://github.com/NeuroDataDesign/pan-synapse/blob/master/background/Sparse_ConnectedComponents_Algorithms.md.ipynb

Evaluating Pipeline: https://github.com/NeuroDataDesign/pan-synapse/blob/master/background/Eval.md.ipynb Thursday's work: https://github.com/NeuroDataDesign/pan-synapse/blob/master/background/precisionrecall.ipynb

Hough Transform: https://github.com/NeuroDataDesign/pan-synapse/blob/master/background/houghTransform.md.ipynb

gkiar commented 7 years ago

@bstadt:
- from your norm notebook do you think we should use windowing too, or just mean norm?
- for hough transform, cool, do you think we should use it? Have you tried it on real data?
@levinwil:
- for the scipy volume centroid implementation, how is this different from previous weeks? The notebook looks the same so I can't tell what changed.
- what's the answer, regarding speed? You show a time number but no reference time, so I don't know if that is good or bad.
- for the sparse connected components, again, can you clarify how this is different from what you've done previously and what clarify the conclusion you can draw from it?
@rguo123:
- you say "At a first glance, these results are very disturbing.", well, I don't see any results in your thursday notebook?
- there's a huge amount of text and it seems only figures of simulation data but not results in the eval markdown? What's up here? I don't really know what you did or why or what to do next based on it

rguo123 commented 7 years ago

Hey Greg, concerning your comments: There should be a lot of Plotly graphs in both of my notebooks showing the results. I just took a look at them and for some reason they have all gone away. I can very quickly run the code again and reupload.

The figures in the eval markdown showed the overlap between the true clusters and our results from the PLOS pipeline. From the graphs, you can see that areas with no clusters suddenly spawn a ton of small ones and some regions that contain true clusters get degraded completely away after PLOS. The conclusion from the eval.md is that PLOS just does not work for our type of data, so an alternative algorithm is necessary.

Update: Here's the notebooks with the graphs showing: Monday: http://nbviewer.jupyter.org/github/NeuroDataDesign/pan-synapse/blob/master/background/Eval.md.ipynb Thursday: http://nbviewer.jupyter.org/github/NeuroDataDesign/pan-synapse/blob/master/background/precisionrecall.ipynb

gkiar commented 7 years ago

Thanks, Richard. Much better, I appreciate the update! What's your plan to move forward and make a thing that works? :)

rguo123 commented 7 years ago

I think it would be the best use of my time to look into alternatives for PLOS that can filter out background noise (based off of size and brightness) from our data successfully. Do you have any algorithms that you suggest I look into? I believe Brandon is also doing this next week, so if you think it might be better for me to focus my time on something else, please let me know.

levinwil commented 7 years ago

Hey, Greg

What I figured out last week was that my updates to Connected Components sped up the second half of our pipeline, but that it was still taking a ridiculous amount of time to finish. The entire job of the second half of our pipeline is to identify clusters, thresholds the ones that are too big, and return information about them (for now, centroids and volumes) given a binary image. At the very tail-end of the week, I found that scipy actually already has methods to return centroids and volumes given a binary image, and quickly ran that on the entire volume. Thus, I wanted to see if that could be used in our pipeline.

This, week, I wanted to accomplish two goals:

For Thursday, I wanted to fully investigate those scipy functions and see how well they would work in our pipeline (this is what the following notebook is devoted to https://github.com/NeuroDataDesign/pan-synapse/blob/master/background/Cluster_Components_Class_Algorithms.md.ipynb). Specifically, I wanted to test out their built-in functions for image labeling, volume thresholding, centroid-finds, and volume-finds. Thus, because it performs the same function as our Connected Components, I used the same simulated data so that I could compare the performance of both and see a.) if they give the same results to make sure that scipy's functions are performing as expected b.) if scipy's functions run more quickly than ours. After scipy's functions on those simulated data sets and our real data set, I found that scipy's functions could threshold the clusters by volume, and find both the centroids and volumes of each cluster in 62.4 seconds (this is the number at the bottom of the following notebook https://github.com/NeuroDataDesign/pan-synapse/blob/master/background/Cluster_Components_Class_Algorithms.md.ipynb). My conclusion from these was that they are amazing functions and work very quickly and are well-suited for what they are supposed to do. Specifically, the volume thresholding function works very well in under a second on our entire volume and should be used in our pipeline regardless of how we decide to move forward with centroid-finding and volume-finding (a.k.a. regardless of how we decide to move forward with the second half of our pipeline). As for scipy's centroid-finding and volume-finding functions, my only hesitation is that if in the future we want to perform any sort of thresholding other than volume thresholding, we won't be able to. Thus, for Monday, I wanted to address the speed issue with our previous version of the second half of our pipeline by using Sparse Arrays so that we could have more flexibility with our thresholding than Scipy's library.
For Monday, I wanted to explore the previous speed issue we were having with the member-find portion of the second half of our pipeline (i.e. after we run Connected Components and each cluster has a different label, the step where we find which indices have a value equal to each label). We looked to Sparse Arrays by your suggestion, which are very well-suited for this issue considering only 2% of our data is non-zero. The main issue I initially ran into with Sparse Arrays is that there currently exists no 3-dimensional implementation. Thus, I initially tried to code it myself (this can be seen here https://github.com/NeuroDataDesign/pan-synapse/blob/master/background/Sparse_Arrays_Algorithms.md.ipynb). As you can see, I ran my implementation on a slice of 5 and found that it took 16 hours for a mere slice of 5, so I immediately knew trying to code it myself wouldn't work. So, I decided to think about the problem differently and store each 2-d slice of our volumes as a Sparse Array. To ensure that I was getting the same results as our original second half of the pipeline, I again ran this algorithm on the same simulated data sets as I did when investigating our previous second half of the pipeline so I could ensure that we were getting the same results. After running the Sparse version of the second half of our pipeline on simulated data and then our real data, I found that the Sparse version runs significantly more quickly (about 14.2 seconds which can be seen at the bottom of the following notebook https://github.com/NeuroDataDesign/pan-synapse/blob/master/background/Sparse_ConnectedComponents_Algorithms.md.ipynb vs. what used to be an hour).

So, TL;DR: the Sparse version of the second half of our pipeline runs extremely quickly, performs the functions it is supposed to, and allows us more flexibility than Scipy's functions for the future. I also think that we should use Scipy's volume thresholding function, as it runs extremely quickly, is simple, and is good.

Thus, I believe the second half of our pipeline should look something like this:

Given the binarized image from the first half, generate unique labels for connected components using Scipy's connected components function
Run Scipy's volume thresholding function, getting rid of any cluster less than 135 voxels (~1 micron by Richard's suggestion)
Run the Sparse version of the second half of our pipeline (essentially, group the same-value indices into our Cluster Class, which allows us to calculate things like centroid, volume, etc. and allows us flexibility for operations like morphological thresholding in the future)

gkiar commented 7 years ago

ok, cool. Thanks guys. Looking forward to chatting on Thursday. :)

bstadt commented 7 years ago

@gkiar I plan to use window shift norm @gkair Hough line transform is poor even on the easiest simulated data. I think it is more for extracting the equation of a line in an image where the line is obvious than finding lines in a noisy image. I do not plan to implement it in the pipeline

NeuroDataDesign / pan-synapse-f16s17

Deliverable 2/16 & 2/20 #47