Closed alecristia closed 3 years ago
All datasets are checked for validity, yes, but that doesn't mean that the user's local copy is also valid. Often, it'll lack the audio recordings, but it may also lack some annotations that are not relevant to the user. This would speak for having a validation check within R.
The alternative is that statistics are kept regarding the validated dataset in the python set, including:
Then within R, the user will derive stats of the data that is relevant to them, and cross-check them against the validated set, which would reveal whether their local copy is missing children/recordings/annotations (for the analysis they are aiming to do).
Something like this already exists: https://childproject.readthedocs.io/en/latest/tools.html?highlight=overview#dataset-overview
This is unavoidable, the user needs to load in R space whatever is relevant. Note however that not everything is relevant every time.
If checking for annotator agreement, only sections with overlaps across annotators are relevant.
If checking for re-recording reliability, recording-level statistics are relevant.
The same for any analysis of e.g. relating adult to child speech, child voc quantity by age, etc. Recording-level stats are all that is needed for all of these.
Other analyses can be thought of, like splitting the day into hours or half days or looking at time-of-day effects. These are rarer, so we can leave full data set loading to only these rarer cases.
This was already referred to above -- i.e., instead of looking at the segment level of granularity (as needed for e.g. annotator agreement), we look at the recording level.
Decisions at this level are often quite stable:
In view of all this, it would make sense to do this just once, and keep this information together with the data, rather than separately from it, with the analyses.
That said, all of the above depend on the annotator. So whatever system we find to keep these secondary statistics, we need to do them for all relevant annotators.
Reading the YODA principles in the DataLad manual, it sounds like perhaps we should use this as inspiration RE secondary statistics: http://handbook.datalad.org/en/latest/_images/dataset_modules.svg
Right now, an initial function of every script is to load the ChildProject in its entirety and check for validity. For Bergelson, this takes over 12 minutes. It also seems to be outputting reliability analyses via repeated recordings.
We don't need to do all of this every time.
We need to split that function into:
Since the datasets are so large, I'm beginning to wonder whether some of these steps don't conceptually belong to ChildProject, which would output dated reports, rather than doing them in the R space. Pro/con list follows in separate comments.