Closed alecristia closed 2 years ago
Is your feature request related to a problem? Please describe. Datasets can be incremental -- for instance, for longitudinal datasets, and it would make sense that recordings get processed at different time points. We already provide code to "--skip-existing" (files that are in raw & have already been processed) and to select "--recordings" (which allows eg re-processing of audio that was faulty or addition of other files. Processing files is accompanied by the creation of a time-stamped yaml file, and addition of rows to a
recordings.csv file
, which can be improved.Describe the solution you'd like
recordings.csv
should include, for each row, which yaml was used- in the case of --skip-existing, we don't want to change
recordings.csv
, but when using the--recordings
switch, we may want to remove rows corresponding to recordings that are being reprocessed?- this one is tricky: it looks like we changed something about the variables that are stored in that file, because the cells don't match up here (other yaml dated 20210717):
converted_filename,original_filename,success,error ... child40/child40_6.wav,child40/child40_6.wav,True, ,child9/REC001.WAV ,False,b'./recordings/raw/child9/REC001.WAV : No such file or directory\n'
converted_filename
is empty, because the conversion failed. The reason is that the file does not exist. The extra space character at the end of child9/REC001.WAV
seems suspicious. Does it also appear in the metadata?
- This is supposed to be the case (if a file is re-processed, this should update the entry in recordings.csv)
confirmed, this is what happens! it occurred to me, however, that perhaps if a file is reprocessed, and the reprocessing leads to an error, whereas the original was NOT an error, we may be want to be nice with the human, and (a) avoid deleting the audio file as well as (b) avoid deleting the row but still (c) let them know something went wrong? I'm not totally sure about this -- it may create more problems than it solves.
the space issue... good point! YES that's it, the metadata/recordings.csv has a space -- which is a human error, so nothing to worry about on your end.
- This is supposed to be the case (if a file is re-processed, this should update the entry in recordings.csv)
confirmed, this is what happens! it occurred to me, however, that perhaps if a file is reprocessed, and the reprocessing leads to an error, whereas the original was NOT an error, we may be want to be nice with the human, and (a) avoid deleting the audio file as well as (b) avoid deleting the row but still (c) let them know something went wrong? I'm not totally sure about this -- it may create more problems than it solves.
the space issue... good point! YES that's it, the metadata/recordings.csv has a space -- which is a human error, so nothing to worry about on your end.
I think an error should not overwrite anything but instead display meaningful error messages. The commit I just posted should do that.
Is your feature request related to a problem? Please describe. Datasets can be incremental -- for instance, for longitudinal datasets, and it would make sense that recordings get processed at different time points. We already provide code to "--skip-existing" (files that are in raw & have already been processed) and to select "--recordings" (which allows eg re-processing of audio that was faulty or addition of other files. Processing files is accompanied by the creation of a time-stamped yaml file, and addition of rows to a
recordings.csv file
, which can be improved.Describe the solution you'd like
recordings.csv
should include, for each row, which yaml was usedrecordings.csv
, but when using the--recordings
switch, we may want to remove rows corresponding to recordings that are being reprocessed?