bmcfee / muda

A library for augmenting annotated audio data
ISC License
230 stars 33 forks source link

RNG seed [formerly Reproducibility enhancements] #31

Closed ejhumphrey closed 5 years ago

ejhumphrey commented 8 years ago

At least two ideas jump out at me re: reproducibility:

  1. RandomDoAThing deformers could optionally take seed params, but always use one internally (and serialize accordingly).
  2. It'd be great if we could reconstruct a deformation pipeline exactly from the "history" object ... which really means either (a) the serialization object should encompass state, which isn't the case for RandomDoAThing deformers, or (b) there's a higher-level object that combines state and pipeline as different objects. The difference here is small (and maybe semantic), but it's a difference between a class and an instance (the pipeline is the class, the state is the instance). This might have interesting repercussions for the design of the Pipeline, which is perhaps more aptly called a PipelineFactory.

please yell if any of this is unclear, I'm kind of stream-of-consciousness working through the idea.

bmcfee commented 8 years ago

Yeah, agree :100:. I recently implemented this kind of thing over in entrofy, so it wouldn't be hard to do.

ejhumphrey commented 8 years ago

so thoughts on the PipelineFactory and Pipeline objects? PipelineFactory is the iterator that yields a Pipeline, which can then be passed a data object to deform. or do you see a simpler approach?

bmcfee commented 8 years ago

Well, if you're reconstructing a deformation pipeline from a muda output, it only has to generate a single example. Parameterizing each element of the pipeline according to its seed (and state number) ought to suffice, so we shouldn't need to generate multiple pipeline objects.

ejhumphrey commented 8 years ago

I'm thinking of the scenario where I generate one pipeline and want to apply it to different audio-jams objects ... to do this currently, I have to keep making new Pipelines with n_samples=1. Intentionally having singleton iterators seems like a design smell, no?

bmcfee commented 8 years ago

I'm thinking of the scenario where I generate one pipeline and want to apply it to different audio-jams objects

Different meaning totally different content? If that's the case, why would you care about porting over random parameters?

If you want to reinstantiate a pipeline, random seeds and all, that can be done with the current serialization code (properly extended to include seeds).

bmcfee commented 7 years ago

Coming off of the discussion in #62, it seems like the more useful version of this idea is to reconstruct a specific deformation sequence from a previous run of muda. This is useful when you have the original audio, deformed jams, and want to rebuild the corresponding deformed audio.

I'm having a hard time thinking of any other reproducibility use cases that can/should be powered by the deformation history of individual outputs.

I specifically don't see the utility in reconstructing a muda pipeline from an output's deformation history. Given the interactions between union, bypass, and pipeline, I'm not sure this is even possible: you'll only get the deformers that actually executed to form this output, not the actual deformation stack. I think encouraging folks to try to abstract up from an instance to the pipeline is an anti-pattern; instead, we should encourage folks to save their pipeline objects alongside the outputs if they want to run further deformations on new data.

So I suggest this issue be consolidated into two enhancements:

  1. Implement the audio re-deformer, as described in #62. This is a minor-revision change.
  2. Add rng seeds to all randomized deformer objects so that serialized pipelines can reproduce exactly. This is a major-revision change.

These two enhancements are independent. Because the deformation history never records randomized objects (only their deterministic parent class), and all state is preserved in the history, you can get reproducibility of randomized deformations for free even without storing the seed. (This, of course, is just for audio re-deformation, not for re-running a deformation sweep on a dataset.)

@ejhumphrey @justinsalamon what do yall think?

justinsalamon commented 6 years ago

+💯 for re-deformer, indeed it appears #62 surfaced precisely because I shared MUDA jams files (https://github.com/justinsalamon/UrbanSound8K-JAMS) to avoid having to distribute the augmented version of US8K we were using in our paper for reproducibility.

Happy to put together a PR, but no cycles in the near horizon :'(