Open yarikoptic opened 4 years ago
The short answer is that spikeforest does not compute such a metric. But the more helpful answer is that such information is available with a relatively short python script. For each ground truth unit we have the accuracy computed across all the sorters. So I think what you are suggesting is to take the max over all sorters.
On Mon, Nov 4, 2019 at 2:57 PM Yaroslav Halchenko notifications@github.com wrote:
I don't remember if I saw something like that on spikeinterface's SfN poster, or not... and I think there is nothing like that reported on spikeforest.
I am interested to get some metric which would reflect amount of "recovered" units across all spike sorters, which would hint that some joint/meta spike sorting using multiple algorithms could be of benefit. E.g. it could be "joint accuracy" -- if we take 2 (3, 4, ...) algorithms, "cherry pick" correctly made decisions across them, estimate accuracy. And then report max accuracy for 2, 3, 4 ... I hope that at some point that "curve" arrives to 100%. But it would be interesting to see if it does for any specific dataset, and which critical value (# of spike sorters, and which bundle) necessary to potentially identify all units. If it doesn't reach 100% -- report the top % (with all of the spike sorters)
Such metric could be useful to e.g. hint on idiosyncracies of spike sorters vs of data itself (noise etc). E.g. if there is a dataset with 100% accuracy achievable with 2 sorters, when any individual one does 70%, we would know that data is good, and algorithms are picking up really based on different features of units. With a dataset which obtains top level accuracy not far from max individual sorter - we would know that it is likely data (noise, or some aspects not covered by any sorter).
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/spikeforest/issues/78?email_source=notifications&email_token=AA4CIQH4QE44H6J3NJJ3ZFDQSB5BZA5CNFSM4JIYMZ7KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HWWM6KQ, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4CIQEMOCFUTDC2IRW5SWDQSB5BZANCNFSM4JIYMZ7A .
How is accuracy per each unit is defined? I thought it would be 0 or 1 (miss or hit)
If you set a threshold (say 80% accuracy) then yes you can consider it as a miss or hit. But the comparison outputs accuracy rates for each ground truth unit as spelled out here: https://spikeforest.flatironinstitute.org/metrics
On Mon, Nov 4, 2019 at 4:47 PM Yaroslav Halchenko notifications@github.com wrote:
How is accuracy per each unit is defined? I thought it would be 0 or 1 (miss or hit)
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/spikeforest/issues/78?email_source=notifications&email_token=AA4CIQEHBLPCADTCCH6QM23QSCJ6PA5CNFSM4JIYMZ7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDA2PCA#issuecomment-549562248, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4CIQGVIYMDDOFSDKD3WDLQSCJ6PANCNFSM4JIYMZ7A .
Thanks - I now feel less ignorant! ;-)
minor quick note (I guess I could just look in the code), but in the equation for SNR_k:
shouldn't
max_m
be outside of the ratio (thus choosing the best sensor), since division is per channel (sigma_m)?
Thanks - I now feel less ignorant! ;-)
what about: mean of accuracies across all "best gross matching units", where for "best gross matching unit" is the best matching across all ("considered") sorting algorithms. So if you "consider" a single algorithm, measure would be different than current accuracy since it is estimated towards each sorted unit, not toward each GT unit (may be that is a bad idea). Then it would be possible to consider each pair of the algorithms (and report max accuracy and "winning" pair), each triplet and so on (although combinatorics aren't really on our side -- it shouldn't be prohibitive), until including all sorting algorithms. So the value for "all sorting algorithms" would be the diagnostic to indicate either it is algorithms (high value) or data (low value) which prohibits individual sorting algorithm to obtain ultimate high accuracy.
If you set a threshold (say 80% accuracy) then yes you can consider it as a miss or hit
could indeed be done that way, but I guess it would be better to avoid additional arbitrary decision parameters (such as 80%)
what about: mean of accuracies across all "best gross matching units", where for "best gross matching unit" is the best matching across all ("considered") sorting algorithms.
That sounds reasonable. I think both mean and max would be helpful. Max is nice because it represents the best one might expect to be able to do. The mean has the disadvantage of being lowered when extra less-optimal algorithms are added into the mix.
Hm... I don't think that adding bad sorter should negatively impact "all sorters" result since per unit we would select best matching (highest accuracy) so that bad sorter just would never provide it. My fear is that actually this metric would ceil too close to 100 on all datasets ;-)
Okay, I see. I wasn't reading carefully enough. I think you are asking... can we use the information from more than one sorter to improve the quality of spike sorting?
Not to improve per se, but get a metric indicative of a limiting factor (sorting vs data) and some kind of an upper bound (for a dataset) of what could be obtained (eg via better sorter or combining multiple)
And then possibly approach the "improve" part, but that would be a separate issue ;-)
@yarikoptic you should have a look to the class MultiSortingComparison in spikecomparison module.
This could be a direction.
Have also a look to this : https://spikeinterface.readthedocs.io/en/latest/sorters_comparison.html
Part of spikecomparison module is internally by spikeforest we also try various ideas that could incorporated in SF one day.
Thanks. Will revisit and plan to integrate this. J
On Thu, Nov 28, 2019 at 5:31 PM Garcia Samuel notifications@github.com wrote:
Have also a look to this : https://spikeinterface.readthedocs.io/en/latest/sorters_comparison.html
Part of spikecomparison module is internally by spikeforest we also try various ideas that could incorporated in SF one day.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/flatironinstitute/spikeforest/issues/78?email_source=notifications&email_token=AA4CIQHQFUWMSCTTBQQ6ZJ3QWBBDPA5CNFSM4JIYMZ7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFNQJXI#issuecomment-559613149, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4CIQGGCQN53J3NMDJUO7DQWBBDPANCNFSM4JIYMZ7A .
Have also a look to this : https://spikeinterface.readthedocs.io/en/latest/sorters_comparison.html
Part of spikecomparison module is internally by spikeforest we also try various ideas that could incorporated in SF one day.
it (spikecomparison) would probably indeed be the best place to add such a metric!
ATM agreement sorter (and computed on it metrics) is going the opposite way -- if I get it right it finds "intersection" (common units). The metric I had in mind would take a "union" and compare to ground truth (so applicable only to datasets with the ground truth) - as such it will not really compare spike sorters, but rather state either the whole collection of spike sorters could in principle identify all units.
I don't remember if I saw something like that on spikeinterface's SfN poster, or not... and I think there is nothing like that reported on spikeforest.
I am interested to get some metric which would reflect amount of "recovered" units across all spike sorters, which would hint that some joint/meta spike sorting using multiple algorithms could be of benefit. E.g. it could be "joint accuracy" -- if we take 2 (3, 4, ...) algorithms, "cherry pick" correctly made decisions across them, estimate accuracy. And then report max accuracy for 2, 3, 4 ... I hope that at some point that "curve" arrives to 100%. But it would be interesting to see if it does for any specific dataset, and which critical value (# of spike sorters, and which bundle) necessary to potentially identify all units. If it doesn't reach 100% -- report the top % (with all of the spike sorters)
Such metric could be useful to e.g. hint on idiosyncracies of spike sorters vs of data itself (noise etc). E.g. if there is a dataset with 100% accuracy achievable with 2 sorters, when any individual one does 70%, we would know that data is good, and algorithms are picking up really based on different features of units. With a dataset which obtains top level accuracy not far from max individual sorter - we would know that it is likely data (noise, or some aspects not covered by any sorter).