LAAC-LSCP / ChildProject

Python package for the management of day-long recordings of children.
https://childproject.readthedocs.io
MIT License
13 stars 5 forks source link

`merge_sets` creates duplicates in `annotations.csv` #384

Closed William-N-Havard closed 1 year ago

William-N-Havard commented 1 year ago

When merging several sets together using merge_sets several times (see below), duplicate lines are created in annotations.csv

am.merge_sets(
        left_set="vtc",
        right_set="alice",
        left_columns=["speaker_type"],
        right_columns=["phonemes", "syllables", "words"],
        output_set="alice_vtc",
    )

Duplicate lines in annotations.csv when running merge_sets twice

alice_vtc,14T_0_20220422_151000.wav,0,0,444797,"all.rttm,ALICE_output_utterances.txt",,14T_0_20220422_151000,14T_0_20220422_151000_0_444797.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,234T_0_20220425_135300.wav,0,0,413106,"all.rttm,ALICE_output_utterances.txt",,234T_0_20220425_135300,234T_0_20220425_135300_0_413106.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220321_095400.wav,0,0,120857,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220321_095400,2_FG_20220321_095400_0_120857.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220325_074500.wav,0,0,239157,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220325_074500,2_FG_20220325_074500_0_239157.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220419_195000.wav,0,0,187617,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220419_195000,2_FG_20220419_195000_0_187617.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220419_195059.wav,0,0,344957,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220419_195059,2_FG_20220419_195059_0_344957.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,6T_0_20220425_135300.wav,0,0,383117,"all.rttm,ALICE_output_utterances.txt",,6T_0_20220425_135300,6T_0_20220425_135300_0_383117.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,12T_0_20220425_185200.wav,0,0,145197,"all.rttm,ALICE_output_utterances.txt",,12T_0_20220425_185200,12T_0_20220425_185200_0_145197.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,23T_0_20220425_135300.wav,0,0,101437,"all.rttm,ALICE_output_utterances.txt",,23T_0_20220425_135300,23T_0_20220425_135300_0_101437.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,14T_0_20220422_151000.wav,0,0,444797,"all.rttm,ALICE_output_utterances.txt",,14T_0_20220422_151000,14T_0_20220422_151000_0_444797.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,234T_0_20220425_135300.wav,0,0,413106,"all.rttm,ALICE_output_utterances.txt",,234T_0_20220425_135300,234T_0_20220425_135300_0_413106.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220321_095400.wav,0,0,120857,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220321_095400,2_FG_20220321_095400_0_120857.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220325_074500.wav,0,0,239157,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220325_074500,2_FG_20220325_074500_0_239157.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220419_195000.wav,0,0,187617,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220419_195000,2_FG_20220419_195000_0_187617.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,2_FG_20220419_195059.wav,0,0,344957,"all.rttm,ALICE_output_utterances.txt",,2_FG_20220419_195059,2_FG_20220419_195059_0_344957.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,6T_0_20220425_135300.wav,0,0,383117,"all.rttm,ALICE_output_utterances.txt",,6T_0_20220425_135300,6T_0_20220425_135300_0_383117.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,12T_0_20220425_185200.wav,0,0,145197,"all.rttm,ALICE_output_utterances.txt",,12T_0_20220425_185200,12T_0_20220425_185200_0_145197.csv,2022-07-28 11:58:41,0.0.5,
alice_vtc,23T_0_20220425_135300.wav,0,0,101437,"all.rttm,ALICE_output_utterances.txt",,23T_0_20220425_135300,23T_0_20220425_135300_0_101437.csv,2022-07-28 11:58:41,0.0.5,

These lines should be dropped before (re)merging the sets and adding the resulting new annotation lines.

LoannPeurey commented 1 year ago

The merge overwrites the previously generated set files without needing an 'overwrite' argument and without any warning. It could probably cause problems when not careful. We could enforce that the output_set must not already exist. Is it possible that a set is constituted of multiple different merges (sounds to me like they should each have a separate set) ? In this case, rerunning your merge would require that you remove the previously generated set before doing so (or we can add a 'replace-set' argument to the merging function to perform the removal prior to merging.

What do you think?

William-N-Havard commented 1 year ago

Yes, I think the best would be to raise an error if the set already exists so that the user first deletes it and re-merges it. Another problem with this set merging is that the resulting set can become outdated if one adds new files to one set that was used for the merge

LoannPeurey commented 1 year ago

yeah, tracking down outdated sets will be kind of hard to do

LoannPeurey commented 1 year ago

385