LAAC-LSCP / solomon-analysis

0 stars 1 forks source link

Questions about report #1

Open alecristia opened 3 years ago

alecristia commented 3 years ago

These Qs refer to this version of Voice type classifier stability.pdf

Data inclusion

In order to test this hypothesis, we calculate the shift between both audios of each pair at every hour, calculating their cross-correlation in 5 minute blocks. Let sij be the shift between the audios of the i-th pair at the j-th hour. We exclude all pairs such that σ(s)i, the standard deviation of (sij) for a fixed i, is higher than a certain threshold (2 seconds, a little higher than the observed audio drift of the records). [...] This excludes all pairs of audios that do not perfectly match during their whole duration, amounting to 50 out of 175.

This has changed from a previous version, in which files that had been 'glued together' were considered suspect (right), but we end up with 50 pairs removed in both cases. So these are the same files? (ie does desyncing & gluing together lead to the same pairs raising alarm?)

Also, we wouldn't want to lose the 50 babies, so how can we decide which USB is more trustworthy? Can we simply use the file with longest duration? Right now, we assigned rec 1 & 2 to the two recs randomly, but perhaps the patterns will be clearer if rec 1 is the "primary" (longer duration) and rec 2 is the "secondary" (shorter duration and/or glued together).

If that option doesn't feel right, a more complicated thing we can try is to use the shift patterns to determine which rec is more trustworthy, trying to exploit the fact that if a rec is more complete, then it will have sections that the other rec doesn't have.

(My main aim here is to have data from as many children as possible, and to minimize inclusion of messy data.)

Details on descriptives (section 5)

lucasgautheron commented 3 years ago

Regarding data inclusion

I think we should consider quality flags for recordings. In this case, the flag would be might_feature_gaps and the recordings for which this flag is set to 1 would be discarded when relevant.

In this case, and more generally (especially when we don't have redundant recordings), we could flags all audios that have undergone merges. But before doing so I would like to hear from the authors about the following questions :

In any case, I am keeping the list of audios for which I had to perform merges. If you look at the commented bits of code in the notebook, you'll see I used this selection criterion to filter out mismatching pairs initially.

Regarding descriptives

alecristia commented 3 years ago

I can provide partial replies to this, and I don't think it'll be useful to ask about all of these to the SolIs team - so apologies for not forwarding all Qs.

RE: why were some audios originally split:

RE the descriptives, let's talk about it on Tue!

alecristia commented 3 years ago

one more thought:

Imagine a file gets split, and you're a tired RA, who's adding the prefixes to the files. You could make a mistake:

Can you think of ways of telling these mistakes apart?

I also thought that mistakes are probably 'grouped' (in that a tired person makes more mistakes). So another way to flag files as problematic is based on the typos we saw in the naming. Did you keep a record of that, and would that help us choose?

alecristia commented 3 years ago

one more thought on that: if I mix up the order within a recording, VTC should still return a similar quantity of speech -- but not if I mix across kids