LAAC-LSCP / ChildRecordsR

R package for the evaluation of annotations of daylong recordings
https://laac-lscp.github.io/ChildRecordsR
0 stars 1 forks source link

Loading project is unnecessarily long and contains functions that may not be needed each time #57

Closed alecristia closed 3 years ago

alecristia commented 3 years ago

Right now, an initial function of every script is to load the ChildProject in its entirety and check for validity. For Bergelson, this takes over 12 minutes. It also seems to be outputting reliability analyses via repeated recordings.

We don't need to do all of this every time.

We need to split that function into:

Since the datasets are so large, I'm beginning to wonder whether some of these steps don't conceptually belong to ChildProject, which would output dated reports, rather than doing them in the R space. Pro/con list follows in separate comments.

alecristia commented 3 years ago

Checking for validity

All datasets are checked for validity, yes, but that doesn't mean that the user's local copy is also valid. Often, it'll lack the audio recordings, but it may also lack some annotations that are not relevant to the user. This would speak for having a validation check within R.

The alternative is that statistics are kept regarding the validated dataset in the python set, including:

Then within R, the user will derive stats of the data that is relevant to them, and cross-check them against the validated set, which would reveal whether their local copy is missing children/recordings/annotations (for the analysis they are aiming to do).

Something like this already exists: https://childproject.readthedocs.io/en/latest/tools.html?highlight=overview#dataset-overview

alecristia commented 3 years ago

Loading annotations as required by the user

This is unavoidable, the user needs to load in R space whatever is relevant. Note however that not everything is relevant every time.

If checking for annotator agreement, only sections with overlaps across annotators are relevant.

If checking for re-recording reliability, recording-level statistics are relevant.

The same for any analysis of e.g. relating adult to child speech, child voc quantity by age, etc. Recording-level stats are all that is needed for all of these.

Other analyses can be thought of, like splitting the day into hours or half days or looking at time-of-day effects. These are rarer, so we can leave full data set loading to only these rarer cases.

alecristia commented 3 years ago

Deriving secondary statistics at the level of the recording

This was already referred to above -- i.e., instead of looking at the segment level of granularity (as needed for e.g. annotator agreement), we look at the recording level.

Decisions at this level are often quite stable:

In view of all this, it would make sense to do this just once, and keep this information together with the data, rather than separately from it, with the analyses.

That said, all of the above depend on the annotator. So whatever system we find to keep these secondary statistics, we need to do them for all relevant annotators.

alecristia commented 3 years ago

Reading the YODA principles in the DataLad manual, it sounds like perhaps we should use this as inspiration RE secondary statistics: http://handbook.datalad.org/en/latest/_images/dataset_modules.svg