SpikeInterface / spikeinterface

A Python-based module for creating flexible and robust spike sorting pipelines.
https://spikeinterface.readthedocs.io
MIT License
531 stars 188 forks source link

Seed behavior for generate #1963

Open h-mayorquin opened 1 year ago

h-mayorquin commented 1 year ago

I did not have time to participate in the review of https://github.com/SpikeInterface/spikeinterface/pull/1948.

We added the following auxiliary function to generate:

https://github.com/SpikeInterface/spikeinterface/blob/3359647eb7a75a4f258590a46b3f665b7aac9d1d/src/spikeinterface/core/generate.py#L16-L24

Which generates a different seed at runtime when no default seed is provided.

I think I understand the rationality, it tries to achieve two things. 1) The results of the generator should be the same after saving it and loading (@samuelgarcia, thanks for adding tests for this!). 2) Every call to a generator should generate new traces if the seed is not specified. Let's call this variability by default.

I don't agree completly with 2. When less experienced users utilize let's stay generate_ground_truth_recording with default parameters I want the function to have deterministic behavior. This is better for debugging and error finding as when I run the function on my system it wiill be as similar as possible as their run. With the current behavior given by ensure_seed I will not be able to reproduce their exact results unless I have their specific seed that was generated at runtime. This can be extracted but it will be difficult for new users and it will create one more round of questions and requests when discussing an issue. A more severe version of this problem is presented if users find an error when they run generate_ground_truth_recording (because the specific seed was problematic), but then they run the function again and the error dissappear. This undermines their trust in their algorithm (they just know something is wrong but can't reproduce) and makes it hard for them to report the problem to us.

I think that @samuelgarcia made this decision from a power user point of views as this is more convenient and it makes sense semantically: NoiseGeneratorRecording should generate different noise everytime you call it (just like np.random.rand). But I think error reporting and debugging is an imporant use case of these functions which I think is better served by having a specific seed by default. This is better for less experienced users. Power users that want to actually generate variability are in a better position to know what a seed is, change it themselves and roll a function like ensure_seed by themselves if needed.

Two other minor advantages of having a seed by default instead of generating at runtime: 1) It is better for provenance, kwargs saves exactly what the user inputed instead of a value generated on the fly. 2) Less code in our library and easier to document the behavior.

h-mayorquin commented 4 months ago

I haved changed my view somehow and I think that most of the functions that interact with the user should actually have the behavior that @samuelgarcia implemented.