Loading project is unnecessarily long and contains functions that may not be needed each time

alecristia commented 3 years ago

Right now, an initial function of every script is to load the ChildProject in its entirety and check for validity. For Bergelson, this takes over 12 minutes. It also seems to be outputting reliability analyses via repeated recordings.

We don't need to do all of this every time.

We need to split that function into:

one that checks for validity (perhaps by calling ChildProject functions, rather than doing its own check)
one that loads annotations as required by the user
one that derives secondary statistics at the level of the recording
etc

Since the datasets are so large, I'm beginning to wonder whether some of these steps don't conceptually belong to ChildProject, which would output dated reports, rather than doing them in the R space. Pro/con list follows in separate comments.

alecristia commented 3 years ago

Checking for validity

All datasets are checked for validity, yes, but that doesn't mean that the user's local copy is also valid. Often, it'll lack the audio recordings, but it may also lack some annotations that are not relevant to the user. This would speak for having a validation check within R.

The alternative is that statistics are kept regarding the validated dataset in the python set, including:

N of children
N of recordings
N of annotations, by type

Then within R, the user will derive stats of the data that is relevant to them, and cross-check them against the validated set, which would reveal whether their local copy is missing children/recordings/annotations (for the analysis they are aiming to do).

Something like this already exists: https://childproject.readthedocs.io/en/latest/tools.html?highlight=overview#dataset-overview

alecristia commented 3 years ago

Loading annotations as required by the user

This is unavoidable, the user needs to load in R space whatever is relevant. Note however that not everything is relevant every time.

If checking for annotator agreement, only sections with overlaps across annotators are relevant.

If checking for re-recording reliability, recording-level statistics are relevant.

The same for any analysis of e.g. relating adult to child speech, child voc quantity by age, etc. Recording-level stats are all that is needed for all of these.

Other analyses can be thought of, like splitting the day into hours or half days or looking at time-of-day effects. These are rarer, so we can leave full data set loading to only these rarer cases.

alecristia commented 3 years ago

Deriving secondary statistics at the level of the recording

This was already referred to above -- i.e., instead of looking at the segment level of granularity (as needed for e.g. annotator agreement), we look at the recording level.

Decisions at this level are often quite stable:

we always want to sum quantities of events and in some cases we want to average event duration (although one could imagine checking whether average is truly more informative than median or some other quantile)
we always want to control for recording duration
we sometimes want to consider separately events that overlap over time (for some analyses, for comparability with LENA, talker overlap should be excluded, ie treated as silence)

In view of all this, it would make sense to do this just once, and keep this information together with the data, rather than separately from it, with the analyses.

That said, all of the above depend on the annotator. So whatever system we find to keep these secondary statistics, we need to do them for all relevant annotators.

alecristia commented 3 years ago

Reading the YODA principles in the DataLad manual, it sounds like perhaps we should use this as inspiration RE secondary statistics: http://handbook.datalad.org/en/latest/_images/dataset_modules.svg

LAAC-LSCP / ChildRecordsR