Add optional manual curation step

klaragerlei commented 3 years ago

Is your feature request related to a problem? Please describe. My data has a lot of split clusters, especially when neurons are bursty (which is a feature of deep MEC cells that I'm targeting). Another issue is artifacts getting clustered to neurons during opto-tagging.

I don't think this is a problem for all types of experiments, and I think it's very useful to just use MS to get an initial idea for any data set, so I wouldn't want to change how the analysis pipeline works by default, but add this as an optional feature. In cases like my project, where I'm looking for a relatively small population, it's very important that there aren't any duplicates because of split clusters.

For example:

Describe the solution you'd like I would like to make it possible to run the pre-processing and sorting parts only, save the output in a format that can be opened by phy (I'm still looking into options for what to use), save manual curation results and make the post-sorting use these. I would also build something in to check whether there are manual results to prevent any manual curation getting overwritten by accident.

So analyzing a data set would include these separate steps:

run pre-processing and sorting an entire data set (automatically)
manually curate recordings (not on Eleanor)
run post-processing on Eleanor for curated data

A few ideas for how this could work:

Use an env variable to set to 'manual_preprocess' mode that control_sorting_analysis.py checks and only runs the pre-processing and then set to 'manual_postprocess' or similar to use the manual curation results.
Write a separate script that's used for this instead of control_sorting_analysis
Add this info in the parameters file (I don't like this option)

@4iar @teristam @HDClark94 please let me know what implementation you'd prefer!

teristam commented 3 years ago

Hi,

I think in general it is very good idea to save the output of the spike sorting results to a file. And then all subsequent analysis can be based on that file. This can save us a lot of time if all we want is to change some analysis in the post-sorting steps. So I think the workflow can be something like this:

Spike sorting saves its results to a file
phy-related scripts make a backup of the original file, convert it to a format that can be opened by phy
Do the manual curation in phy, and output a file with the same format as in Step 1
Postprocessing continue processing the sorting results

The advantage of this approach is that Step 2 and 3 can be skipped for most recordings and it won't add extra burden to the rest of the codebase. Spike sorting and postprocessing don't need to know how the curation is done.

I like the env variable approach more, but instead of using manual_preprocess which is very specific to a particular experimental design, we may use skip_sorting and skip_postprocessing or something like that for a more general use case. There probably should be aother env flag that say whether the pipeline should convert the sorting to a phy-readable format.

klaragerlei commented 3 years ago

One complication is that when I sort multiple recordings together, I would want to do the manual curation on the combined sorted output which then needs to be split back, so the post-processing in this case would need to include the splitting. I'm not sure if there's a better way around this. (This sounds potentially confusing to me, since only one of the recordings will have the sorting output saved.)

HDClark94 commented 3 years ago

Have you tried loading phy for a recording? I tried it on a recording with about 15 cells and my computer was really slow loading everything and trying to inspect the data just became a slow nightmare. Kevin Allan mentioned he only uses homemade curation tools. Would it not be enough to make post sorting scripts that give you an overview of everything you want then decide on cluster mergers? We have waveform, pca plots but not cross cell correlograms. If you could decide based on those alone you wouldn’t need to do that much work. And then you can recompute the combined plots based on the selected merges. Everything’s already saved in DataFrames so no need to save any more outputs.

On Thu, 10 Jun 2021 at 13:38, Klara Gerlei @.***> wrote:

One complication is that when I sort multiple recordings together, I would want to do the manual curation on the combined sorted output which then needs to be split back, so the post-processing in this case would need to include the splitting. I'm not sure if there's a better way around this. (This sounds potentially confusing to me, since only one of the recordings will have the sorting output saved.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/MattNolanLab/in_vivo_ephys_openephys/issues/291#issuecomment-858584981, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGXS63N2GZAACBW2ID2K3HLTSCW3LANCNFSM46OHYIGA .

klaragerlei commented 3 years ago

@HDClark94 you're right about merging, I think this could be doable. I guess we could make a spreadsheet or something manually that the script just loads and then concatenate the firing times.

For removing artifacts I'd want to see the spikes in PC space and be able to interact with it. I don't think it'd be easy to remove these without actually drawing around the cluster in multidimensional space (or adding a lot more code that does this based on amplitude perhaps).

I used manual curation tools before and they opened data very easily (in less than a minute), I think I'll just have to test a few options. There are usually settings that allow you to downsample for visualizing so it doesn't have to load it all.

I haven't started doing this yet, because I wanted opinions on the code architecture first. :)

teristam commented 3 years ago

One complication is that when I sort multiple recordings together, I would want to do the manual curation on the combined sorted output which then needs to be split back, so the post-processing in this case would need to include the splitting. I'm not sure if there's a better way around this. (This sounds potentially confusing to me, since only one of the recordings will have the sorting output saved.)

Is there any reason why we can't save the sorting output for both recordings? I think the guiding principle should be that the new analysis should handle all the complexity within itself as much as possible, while keeping the original pipeline straight forward.

klaragerlei commented 3 years ago

Is there any reason why we can't save the sorting output for both recordings? I think the guiding principle should be that the new analysis should handle all the complexity within itself as much as possible, while keeping the original pipeline straight forward.

Both would be saved, but unless we add more code and manual steps outside the pipeline on top of this curation step, the manual curation results would go in a single file that would need to be split up later on. I would prefer to keep manual steps (including having to run code separately for things) to a minimum so I think it would be better to handle this inside the pipeline code. I'm not sure if there are other good options - how would you implement it?

teristam commented 3 years ago

Is there any reason why we can't save the sorting output for both recordings? I think the guiding principle should be that the new analysis should handle all the complexity within itself as much as possible, while keeping the original pipeline straight forward.

Both would be saved, but unless we add more code and manual steps outside the pipeline on top of this curation step, the manual curation results would go in a single file that would need to be split up later on. I would prefer to keep manual steps (including having to run code separately for things) to a minimum so I think it would be better to handle this inside the pipeline code. I'm not sure if there are other good options - how would you implement it?

mm... I think maybe we can just use a simple script to split the data after manual curation? I can understand that adding it to the pipeline is nice, but running another short independent script after manual curation doesn't seem to add a lot of work to me. The one thing I am trying to avoid is adding too many env variable dependencies into the code...soon it will become very messy.

klaragerlei commented 3 years ago

Another issue I realized we'll face is that the opto (and snippet?) analysis will need the filtered & whitened data in some format... so either the pre-processing or a version of it needs to run again including the splitting step. (I don't think that saving these huge files is an option.)

I still haven't started any of this by the way, but I'm thinking about how to do it. I think I'll just have to try out a few things to see what's best so we have more options to choose from.

klaragerlei commented 3 years ago

Thank you for all the suggestions! I tried to consider all your advice and this is what I'm working towards (on the add_manual_curation branch):

Write a script that takes the path to a recording on datastore as an input and copies the pipeline outputs and raw data to Eleanor (including paired recordings if applicable), combines the spatial_firing and continuous data files and makes a single phy file for them.
Manually curate data in phy and save output (phy saves things in a python friendly format) - Ian's suggestion for setting up phy on Eleanor is working well enough for me. I will share the documentation on how to set things up when I'm done :)
Write another script that processes the saved phy data, splits it if necessary and saves it as a spatial_firing_curated.pkl data frame on the server for each recording in a pipeline compatible format. 4. Modify the pipeline to check if this curated data frame exists and never re-sort if it exists and use it for post-processing.

I would like to do #4, because we need to implement something to prevent overwriting manually curated files. It can take up to an hour to manually curate a big recording so I think we should get the user to delete the files if they want to overwrite them. Otherwise weeks of manual work could be lost by accident. I don't think it's enough to ask the user to set an environmental variable for this, I think it simply shouldn't sort.

I hope this sounds okay. I'm happy to change things if there are issues with this approach.

klaragerlei commented 3 years ago

This is more or less done. I'm not submitting a PR yet, because there is a pipeline bug (mismatch between cluster ids) that needs to be fixed for this to work well.

I wrote some temporary documentation for it here: https://klarazgerlei.notion.site/Manual-curation-using-phy-8bce7611f4d548c7ba0358872975c33a I will add it to /docs once I reach a PR ready version and I'm a bit more confident that you're all relatively happy with this approach. I think it's possible that concatenating multiple sessions leads to more clustering errors relative to sorting single sessions (which is what Tizzy and I validated by comparing the sorting results to manual curation when setting up the pipeline), I see quite a lot of issues in phy that needs fixing...

Any feedback or criticism is welcome. :) I'll continue working on this during neuromatch.

klaragerlei commented 3 years ago

https://github.com/MattNolanLab/in_vivo_ephys_openephys/pull/301

MattNolanLab / in_vivo_ephys_openephys

Add optional manual curation step #291