digitalmethodsinitiative / 4cat

The 4CAT Capture and Analysis Toolkit provides modular data capture & analysis for a variety of social media platforms.
Other
246 stars 59 forks source link

Merging Twitter datasets #408

Closed eeftychiou closed 8 months ago

eeftychiou commented 8 months ago

Describe the bug I am trying to merge three twitterv2 datasets. The processor goes slowly through the first dataset records and once it attempts to start merging the second dataset it fails with the following error Tue Feb 6 15:50:50 2024: Merged 7,393,300 of 16,165,328 items Tue Feb 6 15:50:50 2024: Merged 7,393,400 of 16,165,328 items Tue Feb 6 15:50:50 2024: Merged 7,393,500 of 16,165,328 items Tue Feb 6 15:50:50 2024: Merged 7,393,600 of 16,165,328 items Tue Feb 6 15:50:50 2024: Merged 7,393,700 of 16,165,328 items Tue Feb 6 15:50:50 2024: Cannot merge datasets - not the same set of attributes per item (are they not the same type or has one been altered by a processor?)

Expected behavior As far as I recall the datasets have not been modified and using preview they seem to match.

4CAT Environment

Screenshots, links to datasets, and any additional context The first 1000 lines of each dataset ndjson is attached below.

file1.json file2.json file3.json

stijn-uva commented 8 months ago

Thanks for reporting! It looks like one of the dataset was captured earlier/later than the others and lacks some metrics that are present in the others. ce2b2d5674881d850470ff49bcad22f87e7a45a0 ensures that all metrics are always included in a mapped item, with a value of 0 if the metric is not present, fixing this bug.

We will make a new release later this week and after updating to that one merging should work.