Open alecristia opened 3 years ago
I think that was a bug in Lucas' pipeline, because data_analyses/files_from_zooniverse/maturity-of-baby-sounds-subjects.csv
has all the files in metadata (and more).
When I merge Chiara's metadata & subjects.csv, however, I find too many rows, because some of the file names are reused -- but I cannot tell which are the right one(s). For instance, "0005766928.mp3" was uploaded on 2020-09-09 14:47:21 UTC as subject_id 49578310, and at 2020-09-09 14:20:12 UTC as subject_id 49577130 -- with the SAME subject_set_id = 87385.
in that one subject set id, there are 150 files that received to subject id's, and thus were coded twice. This happened again for one file in another batch. That explains 151 out of the 151 extra rows found when merging those two files.
data_analyses/files_from_elsewhere/metadata_all_PU.csv
, composed by Chiara by putting together all metadata files from PU has 33730 lines = 33730 chunks, with 33730 unique names according to the AudioData columndata_analyses/files_from_elsewhere/dict_4.json
lists 33728 chunks -- 2 fewer than the metadata abovedata_analyses/output/metadata_all_PU.csv
(created by Lucas) lists 19692 chunks, with 19691 different names -- so one of the chunks here got an ambiguous name AND 16k are missingI'm investigating the discrepancy by not using Lucas' pipeline, but instead starting from the raw subject info. This won't fix the 2 chunks that went missing between metadata & dict.