TorchDSP / torchsig

TorchSig is an open-source signal processing machine learning toolkit based on the PyTorch data handling pipeline.
MIT License
174 stars 40 forks source link

Make generated `WidebandModulationsDataset` samples independent of call order #250

Open dustinlagoy opened 1 month ago

dustinlagoy commented 1 month ago

First, thanks for all the work on this project. It is a great resource!!!

Is your feature request related to a problem? Please describe. If I use the WidebandModulationsDataset to generate samples without using a DataLoader (or DatasetLoader) the samples depend on the order in which they are generated. For example:

data = WidebandModulationsDataset(...)
assert data[0] == data[0]

will fail because the generated sample at index 0 (or any index) changes each time you call data[0]. Both the number and characteristics of the generated signals and the added noise change on each subsequent call. This makes using on-the-fly generation of samples difficult unless one can ensure they are always generated in the same order.

Describe the solution you'd like I think the dataset should generate the exact same sample for a given index regardless of any previous sample generation.

Describe alternatives you've considered When generating samples to be written to disk, or training with on-the-fly generation of samples the data loader may ensure (it does at least for writing samples to disk) that the order of calls to WidebandModulationsDataset.__getitem__ are consistent and work around this issue. This may be sufficient for all practical use cases.

Additional context I opened a pull request (#249) with sufficient changes to fix this issue. I understand if this feature is not desirable. In that case it may be nice to make this behavior clear in the documentation somewhere.

MattCarrickPL commented 1 month ago

Thanks for submitting this. We are discussing internally.

ereoh commented 2 weeks ago

Hello! Just providing some updates.

We are currently in the process of doing a major rehaul and rewrite of our code for v1.0.0, which will hopefully be released by early next year. Our main goal of this rewrite is to allow infinite datasets or on-the-fly generation as you've described (with determinism).

Until then, your PR only seems to work for clean versions of wideband. If you are fine with that, I can merge it. Otherwise, you can wait for the rewrite.