LAAC-LSCP / zoo-babble-validation

Apache License 2.0
0 stars 0 forks source link

missing data from dict_4.json #4

Open alecristia opened 3 years ago

alecristia commented 3 years ago

I'm investigating the discrepancy by not using Lucas' pipeline, but instead starting from the raw subject info. This won't fix the 2 chunks that went missing between metadata & dict.

alecristia commented 3 years ago

I think that was a bug in Lucas' pipeline, because data_analyses/files_from_zooniverse/maturity-of-baby-sounds-subjects.csv has all the files in metadata (and more).

When I merge Chiara's metadata & subjects.csv, however, I find too many rows, because some of the file names are reused -- but I cannot tell which are the right one(s). For instance, "0005766928.mp3" was uploaded on 2020-09-09 14:47:21 UTC as subject_id 49578310, and at 2020-09-09 14:20:12 UTC as subject_id 49577130 -- with the SAME subject_set_id = 87385.

alecristia commented 3 years ago

in that one subject set id, there are 150 files that received to subject id's, and thus were coded twice. This happened again for one file in another batch. That explains 151 out of the 151 extra rows found when merging those two files.