Questions about report - Githubissues

alecristia commented 3 years ago

These Qs refer to this version of Voice type classifier stability.pdf

Data inclusion

In order to test this hypothesis, we calculate the shift between both audios of each pair at every hour, calculating their cross-correlation in 5 minute blocks. Let sij be the shift between the audios of the i-th pair at the j-th hour. We exclude all pairs such that σ(s)i, the standard deviation of (sij) for a fixed i, is higher than a certain threshold (2 seconds, a little higher than the observed audio drift of the records). [...] This excludes all pairs of audios that do not perfectly match during their whole duration, amounting to 50 out of 175.

This has changed from a previous version, in which files that had been 'glued together' were considered suspect (right), but we end up with 50 pairs removed in both cases. So these are the same files? (ie does desyncing & gluing together lead to the same pairs raising alarm?)

Also, we wouldn't want to lose the 50 babies, so how can we decide which USB is more trustworthy? Can we simply use the file with longest duration? Right now, we assigned rec 1 & 2 to the two recs randomly, but perhaps the patterns will be clearer if rec 1 is the "primary" (longer duration) and rec 2 is the "secondary" (shorter duration and/or glued together).

If that option doesn't feel right, a more complicated thing we can try is to use the shift patterns to determine which rec is more trustworthy, trying to exploit the fact that if a rec is more complete, then it will have sections that the other rec doesn't have.

(My main aim here is to have data from as many children as possible, and to minimize inclusion of messy data.)

Details on descriptives (section 5)

5.1.2 fraction of time ~ of the duration time, so taking the medians, 40% of the rec contains speech (roughly 12 %CHI, 13% FEM, 5 %MAL, 10 %OCH) -- if so, I'm very surprised, it's much higher than what we've observed in the past! (But would be consistent with reports that this community is very verbal)
5.2.2 count/s literally means number of vocs per second? this also seems high to me - for CHI it goes to 200 vocs/h, or 2000 vocs a day, which is high for 0-3 year olds...
5.2.2 what does time (s) mean in the second chart?
notice incidentally that the patterns in 5.3.2-3 conflict with Oller's results about most vocalizations not being socially driven, since most child voc time is in conversations (eyeballing 5.3.2 against 5.1.2, they are about 80%)
for turns, could you please add to the report what was the threshold for turns used?

lucasgautheron commented 3 years ago

Regarding data inclusion

I think we should consider quality flags for recordings. In this case, the flag would be might_feature_gaps and the recordings for which this flag is set to 1 would be discarded when relevant.

In this case, and more generally (especially when we don't have redundant recordings), we could flags all audios that have undergone merges. But before doing so I would like to hear from the authors about the following questions :

Why were some audios originally split (and why many of them were not) ?
Why some of the split audios had a _REC00x suffix and why some others had a different suffix ?
Did some of the audio files in the dropbox have already undergone merges beforehand ?

In any case, I am keeping the list of audios for which I had to perform merges. If you look at the commented bits of code in the notebook, you'll see I used this selection criterion to filter out mismatching pairs initially.

Regarding descriptives

Regarding count/s : just to make sure, do you agree with the implementation here : https://github.com/LAAC-LSCP/ChildRecordsData/blob/master/ChildProject/annotations.py#L365 ?
The turn taking threshold was set to 1 second. I'll mention it. Do you think it should be lower ? Maybe this explains why cds is that high.

alecristia commented 3 years ago

I can provide partial replies to this, and I don't think it'll be useful to ask about all of these to the SolIs team - so apologies for not forwarding all Qs.

RE: why were some audios originally split:

in the USBs we used with Heidi & Gandhi, recordings were systematically split every 4 hours -- that was just the way the software in the USB worked at the time
the SolIs team bought a different brand (it was 3-4y later) and theirs did not split recordings systematically. However, at some point, they reported that some files were split, but there didn't seem to be a pattern (i.e., it wasn't the same USB that was splitting the files systematically)
we can look at the original duration of files; I suspect that when SolIs had files that were split, this was most commonly due to (a) tampering (ie child takes out usb, flips switch back and forth a few times -- this predicts that it won't happen at regular time points, but instead, there may be a bunch of very short files -- seconds long -- followed by longer files -- hours long); (b) malfunction (ie USB not working properly for some reason, perhaps bc it gets too warm, or wet, or... -- this also shouldn't happen at regular time points, and instead there could be short files that are longer than if someone is playing with the device, perhaps in the order of minutes as the system that had cooled down reheats)
I very much doubt that any of the files have been merged before they got to us, because I think people were not trained to do that (and few people have that expertise from real life!)

RE the descriptives, let's talk about it on Tue!

alecristia commented 3 years ago

one more thought:

Imagine a file gets split, and you're a tired RA, who's adding the prefixes to the files. You could make a mistake:

add a prefix to the wrong file. This would lead to a completely mismatching section across paired USBs
confuse the file order (eg, you call rec01 what is actually rec02 -- I think this is unlikely because this numbering comes from the USB software)

Can you think of ways of telling these mistakes apart?

I also thought that mistakes are probably 'grouped' (in that a tired person makes more mistakes). So another way to flag files as problematic is based on the typos we saw in the naming. Did you keep a record of that, and would that help us choose?

alecristia commented 3 years ago

one more thought on that: if I mix up the order within a recording, VTC should still return a similar quantity of speech -- but not if I mix across kids

LAAC-LSCP / solomon-analysis

Questions about report #1

Data inclusion

Details on descriptives (section 5)

Regarding data inclusion

Regarding descriptives