Zooniverse use cases - Githubissues

lucasgautheron commented 3 years ago

Project	Dataset	Sampling algorithm	Classification task(s)	Launch Date	URL
Marvin's pilot	LENA/BabyLogger data	i) for each recorded child, extract 10 sections of 30 seconds that are NOT silence (according to simple loudness detector) ii) cut these sections up into 500ms clips. NOTE : Loudness detector score should be averaged across LENA and BabyLogger	child voc, female adult speech, male adult speech, junk	ASAP	https://www.zooniverse.org/lab/14957 (if you can not access it, click "my projects" on the top left menu of zooniverse
ERC WP1e/g	"all"	extract 350 vocs from CHI & FEM (vtc) for each recorded child, cut into 500ms (note: it would be good if the "skip/exclude" procedure would be in place)	1) CHI/FEM/Junk; 2) if CHI or FEM, crying, laughing, canonical, non-canonical	no rush	doesn't exist yet
zoo-phon (gold)	1 pilot recording	if there are enough, pull out 250 randomly & 250 from high child volubility 1-minute regions (the latter should be "consecutive vocs" ie take top minute and pull out all the vocs from there, then move to next minute, etc) - IS NOT chunkified!	NONE! will be annotated in the lab	late Feb?	(none)
zoo-phon (pilot)	same pilot rec as above	same segments as above, but they are processed in different ways: 1. the usual 500ms chunks; 2. cut at provided list of boundaries (coded by human); 3. cut at provided list of boundaries (coded by machine) [NOTE: both lists are provided by collaborators]	will be set up by collaborator, involves transcribing using the International Phonetic Alphabet	probably in April	(tbd by collaborator)
zoo-phon	10 randomly selected children from each of 5 corpora tbd	depends on results of pilor, but will probably be based on user-provided list of segments & boundaries	same as above	probably in September	(tbd by collaborator)

MarvinLvn commented 3 years ago

Updated with my info :)

alecristia commented 3 years ago

Updated below

Project	Dataset	Sampling algorithm	Classification task(s)	Launch Date	URL
Marvin's pilot	LENA/BabyLogger data	i) for each recorded child, extract 10 sections of 30 seconds that are NOT silence (according to simple loudness detector) ii) cut these sections up into 500ms clips. NOTE : Loudness detector score should be averaged across LENA and BabyLogger	child voc, female adult speech, male adult speech, junk	ASAP	https://www.zooniverse.org/lab/14957 (if you can not access it, click "my projects" on the top left menu of zooniverse
ERC WP1e/g	"all"	extract 350 vocs from CHI & FEM (vtc) for each recorded child, cut into 500ms (note: it would be good if the "skip/exclude" procedure would be in place)	1) CHI/FEM/Junk; 2) if CHI or FEM, crying, laughing, canonical, non-canonical	no rush	doesn't exist yet
zoo-phon (gold)	1 pilot recording	if there are enough, pull out 250 randomly & 250 from high child volubility 1-minute regions (the latter should be "consecutive vocs" ie take top minute and pull out all the vocs from there, then move to next minute, etc) - IS NOT chunkified!	NONE! will be annotated in the lab	late Feb?	(none)
zoo-phon (pilot)	same pilot rec as above	same segments as above, but they are processed in different ways: 1. the usual 500ms chunks; 2. cut at provided list of boundaries (coded by human); 3. cut at provided list of boundaries (coded by machine) [NOTE: both lists are provided by collaborators]	will be set up by collaborator, involves transcribing using the International Phonetic Alphabet	probably in April	(tbd by collaborator)
zoo-phon	10 randomly selected children from each of 5 corpora tbd	depends on results of pilor, but will probably be based on user-provided list of segments & boundaries	same as above	probably in September	(tbd by collaborator)

lucasgautheron commented 3 years ago

This is the roadmap I suggest:

Define a flexible metadata format to store the information about the chunks. We are done (or almost) with that already, but we need to think this thoroughly to make it work in the long run
Implement "custom segments" support in child-project zooniverse extract-chunks. Thus, users can do then own magic for the sampling and use our tool for the extraction and upload of chunks anyway.
I'll write scripts to implement the most urgent sampler needs (e.g. @MarvinLvn's way) to be used altogether with child-project zooniverse extract-chunks.
If some of these procedures turn out to be standards / not too arbitrary, we can implement them into child-project zooniverse extract-chunks.

Does it sound good to you ?

NB: I cannot run the scripts until the data is properly packaged. Which might be difficult without the cluster.

alecristia commented 3 years ago

to clarify:

for now, we implement a minimum within the package, and leave scripts for sampling outside of the package. Then, as these scripts get reused (or not) we make decisions of which to work into the package. Did I get that right?

If so, that sounds like an ideal approach -- instead of making decisions about what is likely or not, and then signing up to update code for those decisions, we have a period of observation in terms of what are the most common decisions.

lucasgautheron commented 3 years ago

Yep, exactly!

LAAC-LSCP / ChildProject

Zooniverse use cases #109