Closed JamesOwers closed 5 years ago
lgtm. Made a few changes.
Regarding your TODOs:
-Implement rounding to nearest time_increment in df_to_command_str: DONE -handle cases where there is a longer pause than max_time_shift in df_to_command_str (just output multiple time shift commands in a row: DONE -Overlapping notes mess up the conversion back from commands to df (see issue #20 ): Complex, also see issues #46 and #8.
I'll let you check over this and complete the pull whenever you're ready. In the meantime I'll branch off of this point and continue working.
Phew, that was a big one. I'm loving this process man, thanks.
I think the most important thing to sort is the overlapping pitches problem. The pipeline for dataset creation and giving an example model should determine what we should do with the --command
flag. I think I'm agreed with you now really, I've no idea how people will want to use this data!
At the moment, on models branch, I added a --formats flag, which takes: [none, pianoroll, command] (like --datasets), and creates the pytorch Dataset csvs for whichever format you select. By default, all.
First draft here. Rerun
make_dataset.py
and you'll get new files{train,valid,test}_cmd_corpus.csv
. These files are read by the dataset inmdtk.pytorch_datasets
to produce instances of a given index. For example:Each data item is a dictionary containing the degraded commands as integers (in a list), likewise for the clean commands, and the degradation label (I'm assuming that deg_label 0 will always mean 'not degraded', so we don't need a is_degraded flag).
The vocabulary can be used to get the tokens back:
As you can see....that's rather more
<unk>
than I would like...We can do most of the tasks with this data. And you can use the transform to get torch versions of the data if required.
With this transformed data, the dataset can happily be used with the vanilla pytorch dataloader for batching:
Finally, you can use the nn.Embedding layer as the first layer to the net to get from these integers to one hots (or learned vectors...). For an example, see how they do token embedding with BERT here (it calls out the here)
TODO:
df_to_command_str
max_time_shift
indf_to_command_str
(just output multiple time shift commands in a row