switching to integer segments onset and offset ?

lucasgautheron commented 3 years ago

Is your feature request related to a problem? Please describe.

Currently, segment_onset and segment_offset are treated as floats and expressed in seconds. Sometimes, when a dataframe is overwritten, this leads some of the values to change because of the floating point representation limitations. For instance, 3600 might become 360.000000001.

There is a number of options:

not care, since we assume these fluctuations will remain negligible. Cons:
- these fluctuations appear in metadata/annotations diffs when the index is updated
- every manipulation that involves exact matching of segments requires to cast them to integers beforehand
use integers instead, expressed in milliseconds. Cons:
- Millisecond accuracy is arbitrary. ALICE uses tenths of milliseconds iirc.

Whatever we decide, I think we should always be consistent and express all times in the same unit (seconds or milliseconds)

alecristia commented 3 years ago

this could be really problematic in the future, for instance with apparent changes in files when person A versus person B commits, simply due to differences in their OS.

So this is a big reason for integers, leaving the question of milliseconds versus a smaller unit.

I propose we use the smallest unit currently used -- if ALICE uses 10ths of ms, then that. (Though I'm surprised by that -- speech tech systems typically use 10 ms as minimal units, i.e. .01 seconds, and not .0001 seconds -- so perhaps double check ALICE?)

Note, if it turns out ALICE, like VTC, uses 10 ms, let's nonetheless use ms, since human expert annotators often use ms.

lucasgautheron commented 3 years ago

By the way: pandas does not support integer columns with NaN values. They will be casted to floats, always... Which may be a problem if we decide to switch to integers, though I think these columns should never have NaN values.

lucasgautheron commented 3 years ago

this could be really problematic in the future, for instance with apparent changes in files when person A versus person B commits, simply due to differences in their OS.

So this is a big reason for integers, leaving the question of milliseconds versus a smaller unit.

I propose we use the smallest unit currently used -- if ALICE uses 10ths of ms, then that. (Though I'm surprised by that -- speech tech systems typically use 10 ms as minimal units, i.e. .01 seconds, and not .0001 seconds -- so perhaps double check ALICE?)

Note, if it turns out ALICE, like VTC, uses 10 ms, let's nonetheless use ms, since human expert annotators often use ms.

Quoting ALICE's README:

Timestamps appended to the filenames are of form <onset_time_inms x 10> <offset_time_in_ms x 10>, as measured from the beginning of each audio file. For instance, _00062740_00096150.wav stands for an utterance in that started at 6.274 seconds and ended at 9.615 seconds.

So their examples have a millisecond accuracy, but are stored as tenths of milliseconds integers for some reason.

The question is: are we ever going to need to define segments with accuracies better than the millisecond level ?

LAAC-LSCP / ChildProject

switching to integer segments onset and offset ? #149