High-Volubility sampler

lucasgautheron commented 3 years ago

Is your feature request related to a problem? Please describe.

https://github.com/aclew/EAF_builder_scripts/tree/Second-Version implements a sampler that draws segments from the recordings and generates .eaf files based on this selection.

We would like to integrate this code into ChildRecordsData, so that it could be run over any dataset formatted according to our guidelines. The package should allow to estimate volubility in a consistent manner regardless of the input annotations origin (lena, or vtc/alice).

However, our package aims to decouple the sampling of the segments to annotate, from the subsequent creation of the .eaf files or zooniverse chunk extraction and upload. Indeed, one might want to use the high-volubility sampling algorithm together with Zooniverse rather than ELAN. Similarly, one might want to create .eaf files from segments sampled using another algorithm

This means that there should be two pipelines:

The sampler
The .eaf creation pipeline

Describe the solution you'd like

Here are the parameters of the original pipeline, mapped to the ChildRecordsData pipeline they should belong to:

--t is the length of choosen chunks in minutes -> sampler pipeline
--skip is the interstimulis interval for the periodic method in minutes -> sampler pipeline
--n is the number of chunks to be chosen for the random method -> sampler pipeline
--c_on is the difference between code onset and context onset in minutes -> eaf pipeline
--c_off is the difference between code offset and context offset in minutes -> eaf pipeline
--temp is the choice of templates between basic, native or non-native(only vcm for all tiers including CHI) -> eaf pipeline
--its is the parameter that defines if you want to process its files -> sampler pipeline
--n_its number of its information chunks to be chosen -> sampler pipeline
--its_types list of its types needed (AWC, CVC and CTC) -> sampler pipeline
--overlap the paralater that defines if its information time segments can overlap with main methods(random or periodic) time segments -> sampler pipeline

sarpu commented 3 years ago

So a few questions about the original parameters and some other points as they relate to the high-volubility sampler:

The original code's high-volubility calculation is for fixed time windows (5 minutes). I will use the --t parameter to determine this time in seconds as @lucasgautheron set it up, but just thought I'd note it here.
Relatedly, --skip parameter is not used for calculating the high volubility windows, but I can also integrate that (though since it is not included in the parameters I guess it is not going to be used?). However, this would mean that we would need to think a bit deeper: do we want to set/align the high-volubility windows such that we never accidentally place a skip where there might be high volubility? Or are we ok with starting from the beginning of the file and going sequentially (window, skip, window, skip...) That is doable, but would be more computationally sophisticated and the original. Ofc, there are only 2 variations in such a scheme (we can either start with a skip or a window), but more generally, what do we wish to calculate with high-volubility? If we don't have to be sequential at all, then that is completely different. I will probably just push the original code's naive version, but still wanted to point it out.
--n this is already in the parameter list.
--its this is not in the parameter list, and should not be, since there is no way to to calculate volubility without info from its files (or some other annotation file that has such info). I noticed that this needs to be set with the annotation_set parameter (e.g. --annotation_set its) for high-volubility sampler to be called with the vandam-demo files.
--n_its self-explanatory and already in sampler parameters.
--its_types ditto
--overlap this is not used to calculate high-volubility windows. I am not sure how (if we should at all) to integrate this info, and the original code is somewhat confusing. I believe it simply removes high-volubility windows that might overlap with randomly or periodically sampled windows.

In addition, I ran the original code and checked the results against the segments calculated by the high-volubility sampler we are implementing to compare the CTC, CVC, AWC, etc. There are slight differences in the numbers, and while not 100% sure, from stepping around the annotations.py code, I believe they are due to the available parameter choices (the original code always calculates slightly more, which I guess might be because it also counts other speaker types here?)

lucasgautheron commented 3 years ago

So a few questions about the original parameters and some other points as they relate to the high-volubility sampler:

The original code's high-volubility calculation is for fixed time windows (5 minutes). I will use the --t parameter to determine this time in seconds as @lucasgautheron set it up, but just thought I'd note it here.

noted ! we can rename to something more explanatory later anyway

Relatedly, --skip parameter is not used for calculating the high volubility windows, but I can also integrate that (though since it is not included in the parameters I guess it is not going to be used?). However, this would mean that we would need to think a bit deeper: do we want to set/align the high-volubility windows such that we never accidentally place a skip where there might be high volubility? Or are we ok with starting from the beginning of the file and going sequentially (window, skip, window, skip...) That is doable, but would be more computationally sophisticated and the original. Ofc, there are only 2 variations in such a scheme (we can either start with a skip or a window), but more generally, what do we wish to calculate with high-volubility? If we don't have to be sequential at all, then that is completely different. I will probably just push the original code's naive version, but still wanted to point it out.

I think you can just ignore this skip parameter for the moment, to be honest I did not put much thinking into this, this is mostly a copy/paste of the README

--n this is already in the parameter list.

--its this is not in the parameter list, and should not be, since there is no way to to calculate volubility without info from its files (or some other annotation file that has such info). I noticed that this needs to be set with the annotation_set parameter (e.g. --annotation_set its) for high-volubility sampler to be called with the vandam-demo files.

indeed !

--n_its self-explanatory and already in sampler parameters.

--its_types ditto

--overlap this is not used to calculate high-volubility windows. I am not sure how (if we should at all) to integrate this info, and the original code is somewhat confusing. I believe it simply removes high-volubility windows that might overlap with randomly or periodically sampled windows.

I think you are right - again, sorry about that incautious copy/paste

In addition, I ran the original code and checked the results against the segments calculated by the high-volubility sampler we are implementing to compare the CTC, CVC, AWC, etc. There are slight differences in the numbers, and while not 100% sure, from stepping around the annotations.py code, I believe they are due to the available parameter choices (the original code always calculates slightly more, which I guess might be because it also counts other speaker types here?)

I'd say, this is probably a good guess ! Feel free to include more speaker types, and see whether that solves/reduces the discrepancy :)

sarpu commented 3 years ago

Thank you for the response, I just pushed the high volubility sampler. Another question: the original code simply outputs csv files for each of cvc, awc and ctc. Do we want to do the same thing? Or, should the high-volubility sampler set the segments to the merge of these 3 tables (if the user requests 2 or all 3)? In that case, would we want the merged chunks to contain info from all the top segments? So for example, if the user requests top 5 chunks (--n_its is set to 5) and its_types are awc and cvc, and the top chunks for awc scores are chunk numbers 1 through 5, while for cvc they are chunks 6 through 10 (so they don't overlap), should we output the cvc scores for the first 5 chunks and the awc for the next? Or should we just have a column that says which chunk comes from which ordering type?

sarpu commented 3 years ago

In addition, I forgot to mention it in my first comment, but --its_type parameter (which chooses the volubility measurement type) is not currently in the parameter list. Would we like to add that?

Also, --target-speaker-type parameter of high-volubility is not in the list of parameters for the original code, and I am slightly confused about its purpose and how it would integrate. For example, if the target speaker type doesn't include FEM or MAL, then do we still calculate awc?

lucasgautheron commented 3 years ago

Thank you for the response, I just pushed the high volubility sampler. Another question: the original code simply outputs csv files for each of cvc, awc and ctc. Do we want to do the same thing? Or, should the high-volubility sampler set the segments to the merge of these 3 tables (if the user requests 2 or all 3)? In that case, would we want the merged chunks to contain info from all the top segments? So for example, if the user requests top 5 chunks (--n_its is set to 5) and its_types are awc and cvc, and the top chunks for awc scores are chunk numbers 1 through 5, while for cvc they are chunks 6 through 10 (so they don't overlap), should we output the cvc scores for the first 5 chunks and the awc for the next? Or should we just have a column that says which chunk comes from which ordering type?

Thank you so much !

Regarding your question : I think the --its-type parameter is needed indeed. But I think it should allow only one value at once, don't you think ?

lucasgautheron commented 3 years ago

Also, --target-speaker-type parameter of high-volubility is not in the list of parameters for the original code, and I am slightly confused about its purpose and how it would integrate. For example, if the target speaker type doesn't include FEM or MAL, then do we still calculate awc?

no, you are right, it should be dropped

sarpu commented 3 years ago

Thank you so much !

Regarding your question : I think the --its-type parameter is needed indeed. But I think it should allow only one value at once, don't you think ?

I agree.

no, you are right, it should be dropped

Oki then, I will implement that. If we want to make it more complicated in the future we can go that route (weighted 3 way score calculation etc.).

lucasgautheron commented 3 years ago

Good! Hang a second, i am about to commit very minor changes

lucasgautheron commented 3 years ago

Done - you can resume your work!

LAAC-LSCP / ChildProject

High-Volubility sampler #144