LAAC-LSCP / ChildProject

Python package for the management of day-long recordings of children.
https://childproject.readthedocs.io
MIT License
13 stars 5 forks source link

keep track of which recordings were processed how #276

Closed alecristia closed 2 years ago

alecristia commented 2 years ago

Is your feature request related to a problem? Please describe. Datasets can be incremental -- for instance, for longitudinal datasets, and it would make sense that recordings get processed at different time points. We already provide code to "--skip-existing" (files that are in raw & have already been processed) and to select "--recordings" (which allows eg re-processing of audio that was faulty or addition of other files. Processing files is accompanied by the creation of a time-stamped yaml file, and addition of rows to a recordings.csv file, which can be improved.

Describe the solution you'd like

converted_filename,original_filename,success,error ... child40/child40_6.wav,child40/child40_6.wav,True, ,child9/REC001.WAV ,False,b'./recordings/raw/child9/REC001.WAV : No such file or directory\n'

lucasgautheron commented 2 years ago

Is your feature request related to a problem? Please describe. Datasets can be incremental -- for instance, for longitudinal datasets, and it would make sense that recordings get processed at different time points. We already provide code to "--skip-existing" (files that are in raw & have already been processed) and to select "--recordings" (which allows eg re-processing of audio that was faulty or addition of other files. Processing files is accompanied by the creation of a time-stamped yaml file, and addition of rows to a recordings.csv file, which can be improved.

Describe the solution you'd like

  • recordings.csv should include, for each row, which yaml was used
  • in the case of --skip-existing, we don't want to change recordings.csv, but when using the --recordings switch, we may want to remove rows corresponding to recordings that are being reprocessed?
  • this one is tricky: it looks like we changed something about the variables that are stored in that file, because the cells don't match up here (other yaml dated 20210717):

converted_filename,original_filename,success,error ... child40/child40_6.wav,child40/child40_6.wav,True, ,child9/REC001.WAV ,False,b'./recordings/raw/child9/REC001.WAV : No such file or directory\n'

alecristia commented 2 years ago
  • This is supposed to be the case (if a file is re-processed, this should update the entry in recordings.csv)

confirmed, this is what happens! it occurred to me, however, that perhaps if a file is reprocessed, and the reprocessing leads to an error, whereas the original was NOT an error, we may be want to be nice with the human, and (a) avoid deleting the audio file as well as (b) avoid deleting the row but still (c) let them know something went wrong? I'm not totally sure about this -- it may create more problems than it solves.

the space issue... good point! YES that's it, the metadata/recordings.csv has a space -- which is a human error, so nothing to worry about on your end.

lucasgautheron commented 2 years ago

https://github.com/LAAC-LSCP/ChildProject/commit/996dd19fed0c116dba322635f049f68aadf120fb

lucasgautheron commented 2 years ago
  • This is supposed to be the case (if a file is re-processed, this should update the entry in recordings.csv)

confirmed, this is what happens! it occurred to me, however, that perhaps if a file is reprocessed, and the reprocessing leads to an error, whereas the original was NOT an error, we may be want to be nice with the human, and (a) avoid deleting the audio file as well as (b) avoid deleting the row but still (c) let them know something went wrong? I'm not totally sure about this -- it may create more problems than it solves.

the space issue... good point! YES that's it, the metadata/recordings.csv has a space -- which is a human error, so nothing to worry about on your end.

I think an error should not overwrite anything but instead display meaningful error messages. The commit I just posted should do that.