keep track of which recordings were processed how

alecristia commented 2 years ago

Is your feature request related to a problem? Please describe. Datasets can be incremental -- for instance, for longitudinal datasets, and it would make sense that recordings get processed at different time points. We already provide code to "--skip-existing" (files that are in raw & have already been processed) and to select "--recordings" (which allows eg re-processing of audio that was faulty or addition of other files. Processing files is accompanied by the creation of a time-stamped yaml file, and addition of rows to a recordings.csv file, which can be improved.

Describe the solution you'd like

recordings.csv should include, for each row, which yaml was used
in the case of --skip-existing, we don't want to change recordings.csv, but when using the --recordings switch, we may want to remove rows corresponding to recordings that are being reprocessed?
this one is tricky: it looks like we changed something about the variables that are stored in that file, because the cells don't match up here (other yaml dated 20210717):

converted_filename,original_filename,success,error ... child40/child40_6.wav,child40/child40_6.wav,True, ,child9/REC001.WAV ,False,b'./recordings/raw/child9/REC001.WAV : No such file or directory\n'

lucasgautheron commented 2 years ago

Is your feature request related to a problem? Please describe. Datasets can be incremental -- for instance, for longitudinal datasets, and it would make sense that recordings get processed at different time points. We already provide code to "--skip-existing" (files that are in raw & have already been processed) and to select "--recordings" (which allows eg re-processing of audio that was faulty or addition of other files. Processing files is accompanied by the creation of a time-stamped yaml file, and addition of rows to a recordings.csv file, which can be improved.

Describe the solution you'd like

recordings.csv should include, for each row, which yaml was used

in the case of --skip-existing, we don't want to change recordings.csv, but when using the --recordings switch, we may want to remove rows corresponding to recordings that are being reprocessed?

this one is tricky: it looks like we changed something about the variables that are stored in that file, because the cells don't match up here (other yaml dated 20210717):

converted_filename,original_filename,success,error ... child40/child40_6.wav,child40/child40_6.wav,True, ,child9/REC001.WAV ,False,b'./recordings/raw/child9/REC001.WAV : No such file or directory\n'

Good idea, will do!
This is supposed to be the case (if a file is re-processed, this should update the entry in recordings.csv)
This example seems normal to me: converted_filename is empty, because the conversion failed. The reason is that the file does not exist. The extra space character at the end of child9/REC001.WAV seems suspicious. Does it also appear in the metadata?

alecristia commented 2 years ago

This is supposed to be the case (if a file is re-processed, this should update the entry in recordings.csv)

confirmed, this is what happens! it occurred to me, however, that perhaps if a file is reprocessed, and the reprocessing leads to an error, whereas the original was NOT an error, we may be want to be nice with the human, and (a) avoid deleting the audio file as well as (b) avoid deleting the row but still (c) let them know something went wrong? I'm not totally sure about this -- it may create more problems than it solves.

the space issue... good point! YES that's it, the metadata/recordings.csv has a space -- which is a human error, so nothing to worry about on your end.

lucasgautheron commented 2 years ago

https://github.com/LAAC-LSCP/ChildProject/commit/996dd19fed0c116dba322635f049f68aadf120fb

lucasgautheron commented 2 years ago

This is supposed to be the case (if a file is re-processed, this should update the entry in recordings.csv)

confirmed, this is what happens! it occurred to me, however, that perhaps if a file is reprocessed, and the reprocessing leads to an error, whereas the original was NOT an error, we may be want to be nice with the human, and (a) avoid deleting the audio file as well as (b) avoid deleting the row but still (c) let them know something went wrong? I'm not totally sure about this -- it may create more problems than it solves.

the space issue... good point! YES that's it, the metadata/recordings.csv has a space -- which is a human error, so nothing to worry about on your end.

I think an error should not overwrite anything but instead display meaningful error messages. The commit I just posted should do that.

LAAC-LSCP / ChildProject

keep track of which recordings were processed how #276