LAAC-LSCP / ChildProject

Python package for the management of day-long recordings of children.
https://childproject.readthedocs.io
MIT License
13 stars 5 forks source link

Zooniverse use cases #109

Closed lucasgautheron closed 3 years ago

lucasgautheron commented 3 years ago
Project Dataset Sampling algorithm Classification task(s) Launch Date URL
Marvin's pilot LENA/BabyLogger data i) for each recorded child, extract 10 sections of 30 seconds that are NOT silence (according to simple loudness detector) ii) cut these sections up into 500ms clips. NOTE : Loudness detector score should be averaged across LENA and BabyLogger child voc, female adult speech, male adult speech, junk ASAP https://www.zooniverse.org/lab/14957 (if you can not access it, click "my projects" on the top left menu of zooniverse
ERC WP1e/g "all" extract 350 vocs from CHI & FEM (vtc) for each recorded child, cut into 500ms (note: it would be good if the "skip/exclude" procedure would be in place) 1) CHI/FEM/Junk; 2) if CHI or FEM, crying, laughing, canonical, non-canonical no rush doesn't exist yet
zoo-phon (gold) 1 pilot recording if there are enough, pull out 250 randomly & 250 from high child volubility 1-minute regions (the latter should be "consecutive vocs" ie take top minute and pull out all the vocs from there, then move to next minute, etc) - IS NOT chunkified! NONE! will be annotated in the lab late Feb? (none)
zoo-phon (pilot) same pilot rec as above same segments as above, but they are processed in different ways: 1. the usual 500ms chunks; 2. cut at provided list of boundaries (coded by human); 3. cut at provided list of boundaries (coded by machine) [NOTE: both lists are provided by collaborators] will be set up by collaborator, involves transcribing using the International Phonetic Alphabet probably in April (tbd by collaborator)
zoo-phon 10 randomly selected children from each of 5 corpora tbd depends on results of pilor, but will probably be based on user-provided list of segments & boundaries same as above probably in September (tbd by collaborator)
MarvinLvn commented 3 years ago

Updated with my info :)

alecristia commented 3 years ago

Updated below

Project Dataset Sampling algorithm Classification task(s) Launch Date URL
Marvin's pilot LENA/BabyLogger data i) for each recorded child, extract 10 sections of 30 seconds that are NOT silence (according to simple loudness detector) ii) cut these sections up into 500ms clips. NOTE : Loudness detector score should be averaged across LENA and BabyLogger child voc, female adult speech, male adult speech, junk ASAP https://www.zooniverse.org/lab/14957 (if you can not access it, click "my projects" on the top left menu of zooniverse
ERC WP1e/g "all" extract 350 vocs from CHI & FEM (vtc) for each recorded child, cut into 500ms (note: it would be good if the "skip/exclude" procedure would be in place) 1) CHI/FEM/Junk; 2) if CHI or FEM, crying, laughing, canonical, non-canonical no rush doesn't exist yet
zoo-phon (gold) 1 pilot recording if there are enough, pull out 250 randomly & 250 from high child volubility 1-minute regions (the latter should be "consecutive vocs" ie take top minute and pull out all the vocs from there, then move to next minute, etc) - IS NOT chunkified! NONE! will be annotated in the lab late Feb? (none)
zoo-phon (pilot) same pilot rec as above same segments as above, but they are processed in different ways: 1. the usual 500ms chunks; 2. cut at provided list of boundaries (coded by human); 3. cut at provided list of boundaries (coded by machine) [NOTE: both lists are provided by collaborators] will be set up by collaborator, involves transcribing using the International Phonetic Alphabet probably in April (tbd by collaborator)
zoo-phon 10 randomly selected children from each of 5 corpora tbd depends on results of pilor, but will probably be based on user-provided list of segments & boundaries same as above probably in September (tbd by collaborator)
lucasgautheron commented 3 years ago

This is the roadmap I suggest:

  1. Define a flexible metadata format to store the information about the chunks. We are done (or almost) with that already, but we need to think this thoroughly to make it work in the long run
  2. Implement "custom segments" support in child-project zooniverse extract-chunks. Thus, users can do then own magic for the sampling and use our tool for the extraction and upload of chunks anyway.
  3. I'll write scripts to implement the most urgent sampler needs (e.g. @MarvinLvn's way) to be used altogether with child-project zooniverse extract-chunks.
  4. If some of these procedures turn out to be standards / not too arbitrary, we can implement them into child-project zooniverse extract-chunks.

Does it sound good to you ?

NB: I cannot run the scripts until the data is properly packaged. Which might be difficult without the cluster.

alecristia commented 3 years ago

to clarify:

for now, we implement a minimum within the package, and leave scripts for sampling outside of the package. Then, as these scripts get reused (or not) we make decisions of which to work into the package. Did I get that right?

If so, that sounds like an ideal approach -- instead of making decisions about what is likely or not, and then signing up to update code for those decisions, we have a period of observation in terms of what are the most common decisions.

lucasgautheron commented 3 years ago

Yep, exactly!