DocNow / hydrator

Turn Tweet IDs into Twitter JSON & CSV from your desktop!
MIT License
434 stars 64 forks source link

ids in jsonl file don't match original tweet ids #126

Closed rssaketh closed 2 years ago

rssaketh commented 2 years ago

Hi, I'm trying to hydrate a few covid related tweets given by the dataset creators. When I hydrate the tweets and access the jsonl file, the tweet ids don't match the original ids given to hydrate the said tweets. I'm not sure why this is happening. I created a reduced set with 10 tweet ids 1409530436481687559,1420581355176480770,1415615378546466819,1425871126014615558,1409480196944760833,1396357926990667784,1397088197054697473,1415037360706834438,1422531324997521408, 1424781156554186757 and hydrated them. The generated jsonl file has the following tweet ids and id_str respectively 1415615378546466800,1397088197054697500,1409480196944760800,1415037360706834400,1425871126014615600,1424781156554186800,1409530436481687600,1420581355176480800 1415615378546466816,1397088197054697472,1409480196944760832,1415037360706834432,1425871126014615552,1424781156554186752,1409530436481687552,1420581355176480768 I have a couple of questions

  1. From what I understand id_str is the string version of id to prevent reading a long integer. I didn't open the jsonl file in excel or anything (no typecasting done). Why aren't id_str and id matching?
  2. I have to match the extracted tweet (from jsonl) to the original list of ids provided for hydration and since they changed during the hydration process, I cannot map them back (to give you an idea, I have 880k tweet ids, out of which only 13k tweet ids were matched). Why are the tweet ids changing during the hydration process and how to avoid that? Any help is greatly appreciated. Thanks
edsu commented 2 years ago

Duplicate of #25