missing data from dict_4.json

alecristia commented 3 years ago

data_analyses/files_from_elsewhere/metadata_all_PU.csv, composed by Chiara by putting together all metadata files from PU has 33730 lines = 33730 chunks, with 33730 unique names according to the AudioData column
data_analyses/files_from_elsewhere/dict_4.json lists 33728 chunks -- 2 fewer than the metadata above
data_analyses/output/metadata_all_PU.csv (created by Lucas) lists 19692 chunks, with 19691 different names -- so one of the chunks here got an ambiguous name AND 16k are missing

I'm investigating the discrepancy by not using Lucas' pipeline, but instead starting from the raw subject info. This won't fix the 2 chunks that went missing between metadata & dict.

alecristia commented 3 years ago

I think that was a bug in Lucas' pipeline, because data_analyses/files_from_zooniverse/maturity-of-baby-sounds-subjects.csv has all the files in metadata (and more).

When I merge Chiara's metadata & subjects.csv, however, I find too many rows, because some of the file names are reused -- but I cannot tell which are the right one(s). For instance, "0005766928.mp3" was uploaded on 2020-09-09 14:47:21 UTC as subject_id 49578310, and at 2020-09-09 14:20:12 UTC as subject_id 49577130 -- with the SAME subject_set_id = 87385.

alecristia commented 3 years ago

in that one subject set id, there are 150 files that received to subject id's, and thus were coded twice. This happened again for one file in another batch. That explains 151 out of the 151 extra rows found when merging those two files.

LAAC-LSCP / zoo-babble-validation

missing data from dict_4.json #4