DocNow / hydrator

Turn Tweet IDs into Twitter JSON & CSV from your desktop!
MIT License
428 stars 63 forks source link

Tweed IDs Read/Hydrated is greater than Total Tweet IDs and Hydrator keeps fetching tweets #61

Open leslie-huang opened 3 years ago

leslie-huang commented 3 years ago

Hi, I'm trying to hydrate a subset of the 115th Congress tweets from https://catalog.docnow.io/datasets/20190222-115th-us-congress-tweet-ids/ (I already have a partly overlapping dataset).

Hydrator is still fetching tweets even after Total Tweet Ids Read has exceeded total tweet IDs. The "Stop" button has been replaced with the "CSV" button (which makes it seem like it's done?) but the number of tweets read keeps going up.

Screen Shot 2020-10-01 at 7 19 36 PM

And the dataset looks like this:

Screen Shot 2020-10-01 at 7 26 55 PM

(5% was roughly the percentage throughout hydration but doesn't make sense with the other numbers)

Should I hydrate this file again? Or use twarc? Thanks for any advice!

edsu commented 3 years ago

Oh no, that is not good! 5% is very low. If you are on a Mac I would be interested to know if you could count how many lines are in your JSON file.

wc -l tweets.jsonl

Were you hydrating two datasets at the same time? That is something I can test with too. Did you leave the computer on the continuously or did you close the lid of your computer? I'm just trying to understand what might have happened so I can try to test & fix it.

I have used Hydrator with datasets of this size before. But it could be that there is some kind of network or authentication problem that it is not handling properly.

edsu commented 3 years ago

I downloaded the same dataset and started hydrating the representatives-1.txt and senators-1.txtfiles concurrently. One thing I noticed is that the total number of tweet ids (prior to hydration) is different from yours.

Are you sure this is the dataset you are using? If you can share your tweet id files with me I can test with them.

Screenshot from 2020-10-01 20-58-22

leslie-huang commented 3 years ago

I downloaded the same dataset and started hydrating the representatives-1.txt and senators-1.txtfiles concurrently. One thing I noticed is that the total number of tweet ids (prior to hydration) is different from yours.

* representatives-1.txt: 1,522,189

* senators-1.txt: 519,210

Are you sure this is the dataset you are using? If you can share your tweet id files with me I can test with them.

Screenshot from 2020-10-01 20-58-22

Hi Ed!

I'm using this txt file of tweet ids: https://github.com/leslie-huang/congress_tweetdata_prelim/blob/master/tweet_ids_to_hydrate.txt I combined the reps + senators ids from the 115th Congress dataverse files, and then I subtracted a set difference with ids from another dataset of tweets that I already have -- so that's why the total number of tweet ids is smaller. I did the same thing for the 116th Congress (checking for duplicates with my existing data and the 115th) too

leslie-huang commented 3 years ago

Oh no, that is not good! 5% is very low. If you are on a Mac I would be interested to know if you could count how many lines are in your JSON file.

wc -l tweets.jsonl

Were you hydrating two datasets at the same time? That is something I can test with too. Did you leave the computer on the continuously or did you close the lid of your computer? I'm just trying to understand what might have happened so I can try to test & fix it.

I have used Hydrator with datasets of this size before. But it could be that there is some kind of network or authentication problem that it is not handling properly.

The number of lines in the jsonl is 2003053 🤔

I started downloading the 115th dataset first, then started downloading the 116th dataset concurrently after 400k+ items in the 115th dataset were done. I haven't closed the lid on my computer and I set it to not sleep for about 4 hours while I was away from my desk. I didn't have to reauthenticate or anything like that at any point. Hope this info helps!

I'm writing a quick script to check the ids from the jsonl against the tweet ids that I initially requested... will report results soon! Thanks for your help!

edsu commented 3 years ago

One thing you can try doing is starting the hydrator again but from a terminal so you can see the log messages. If you are on a Mac I think you can open a terminal and then start the Hydrator like this:

/Applications/Hydrator.app/Contents/MacOS/Hydrator

It would be interesting to see if those lines in the file you counted have JSON on them or not. I would be interested in the end of the file. The tweet ID file you directed me to looks fine. I thought it was strange that there were old/short tweet ids in there. But I guess they pulled from users timelines and some of them haven't used twitter a lot!

edsu commented 3 years ago

Also, twarc is always an option if Hydrator is giving you trouble. The only thing you will need to have are developer keys. But that is probably the hardest part. I'm happy to help you use twarc if Hydrator continues to be a problem.

leslie-huang commented 3 years ago

Hi Ed, I've finished scanning through the jsonl file and things are looking a little wonky.

I requested: 1671651 tweets Tweets collected = 2003052 (skipped one line in the json, details below) After checking for duplicates... collected tweets = 1188229

requested.difference(collected) = 483422 (items requested but did not collect)

collected.difference(requested) = 0 (items collected but did not request)

tldr: So about 1.2 million unique requested tweets were collected, with ~800k duplicates, and there are ~480k tweets that haven't been collected (or were requested but were deleted tweets).

It doesn't look like I can restart this specific dataset in Hydrator again but it's all good! I'll just generate a new list of tweet ids that I don't already have and delete the duplicates from my json. I just got a developer account with Twitter so I'll check out twarc if this doesn't work.

Thanks a lot for your help! Let me know if I can provide any other info to help with debugging whatever the root problem is here.

There was an exception for just one tweet in the jsonl (it was somewhere in the middle):

raise JSONDecodeError("Expecting value", s, err.value) from None json.dec

oder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

The line it was trying to parse was

{'created_at': 'Mon Dec 16 23:19:38 +0000 2013', 'id': 412723723255283700, 'id_str': '412723723255283712', 'full_text': 'Sen. Franken presses for increasing investments in clean energy to support 21st-century economy & grow MN industry. http://t.co/kbKBAtPVxT', 'truncated': False, 'display_text_range': [0, 142], 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'http://t.co/kbKBAtPVxT', 'expanded_url': 'http://1.usa.gov/19ObGZ6', 'display_url': '1.usa.gov/19ObGZ6', 'indices': [120, 142]}]}, 'source': 'Twitter Web Client', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 171968009, 'id_str': '171968009', 'name': 'U.S. Senator Al Franken', 'screen_name': 'SenFranken', 'location': '', 'description': 'Former U.S. Senator Al Franken of Minnesota.', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 446872, 'friends_count': 1107, 'listed_count': 2255, 'created_at': 'Wed Jul 28 16:11:35 +0000 2010', 'favourites_count': 221, 'utc_offset': None, 'time_zone': None, 'geo_enabled': True, 'verified': True, 'statuses_count': 3991, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': '022241', 'profile_background_image_url': 'http://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_image_url_https': 'https://abs.twimg.com/images/themes/theme1/bg.png', 'profile_background_tile': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/885344348518449152/Fm9yfbNh_normal.jpg', 'profile_image_url_https': 'https://pbs.twimg.com/profile_images/885344348518449152/Fm9yfbNh_normal.jpg', 'profile_banner_url': 'https://pbs.twimg.com/profile_banners/171968009/1508181962', 'profile_link_color': '0084B4', 'profile_sidebar_border_color': 'C0DEED', 'profile_sidebar_fill_color': 'DDEEF6', 'profile_text_color': '333333', 'profile_use_background_image': True, 'has_extended_profile': False, 'default_profile': False, 'default_profile_image': False, 'following': False, 'follow_request_sent': False, 'notifications': False, 'translator_type': 'none'}, 'geo': None, 'coordinates': None, 'place': None, 'contributors': None, 'is_quote_status': False, 'retweet_count': 0, 'favorite_count': 0, 'favorited': False, 'retweeted': False, 'possibly_sensitive': False, 'lang': 'en'}

leslie-huang commented 3 years ago

One thing you can try doing is starting the hydrator again but from a terminal so you can see the log messages. If you are on a Mac I think you can open a terminal and then start the Hydrator like this:

/Applications/Hydrator.app/Contents/MacOS/Hydrator

It would be interesting to see if those lines in the file you counted have JSON on them or not. I would be interested in the end of the file. The tweet ID file you directed me to looks fine. I thought it was strange that there were old/short tweet ids in there. But I guess they pulled from users timelines and some of them haven't used twitter a lot!

here's the tail of the .jsonl:

{"created_at":"Fri Jul 08 17:15:07 +0000 2016","id":751464646222057500,"id_str":"751464646222057472","full_text":"Our Suicide Prevention Act was signed into law. We must help those who bravely served our nation https://t.co/EQdhktOcss","truncated":false,"display_text_range":[0,120],"entities":{"hashtags":[],"symbols":[],"user_mentions":[],"urls":[{"url":"https://t.co/EQdhktOcss","expanded_url":"http://www.ernst.senate.gov/public/index.cfm/press-releases?ID=CAE893B3-A52A-4C9E-977B-31118E7F2EB6","display_url":"ernst.senate.gov/public/index.c…","indices":[97,120]}]},"source":"<a href=\"https://about.twitter.com/products/tweetdeck\" rel=\"nofollow\">TweetDeck</a>","in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2856787757,"id_str":"2856787757","name":"Joni Ernst","screen_name":"SenJoniErnst","location":"Iowa","description":"Follow for news and updates from the Office of Iowa Senator Joni K. Ernst. Tweets from Joni are signed JKE.","url":"http://t.co/zUYK5PQEOd","entities":{"url":{"urls":[{"url":"http://t.co/zUYK5PQEOd","expanded_url":"http://Ernst.Senate.Gov","display_url":"Ernst.Senate.Gov","indices":[0,22]}]},"description":{"urls":[]}},"protected":false,"followers_count":102484,"friends_count":158,"listed_count":1513,"created_at":"Sun Nov 02 13:32:54 +0000 2014","favourites_count":275,"utc_offset":null,"time_zone":null,"geo_enabled":true,"verified":true,"statuses_count":5155,"lang":null,"contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"FFFFFF","profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":false,"profile_image_url":"http://pbs.twimg.com/profile_images/1191837105045098501/bXu9BM8A_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/1191837105045098501/bXu9BM8A_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/2856787757/1568067950","profile_link_color":"4A913C","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":false,"has_extended_profile":false,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false,"translator_type":"none"},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":9,"favorite_count":20,"favorited":false,"retweeted":false,"possibly_sensitive":false,"lang":"en"}
{"created_at":"Sat Nov 26 16:16:55 +0000 2016","id":802546686350528500,"id_str":"802546686350528517","full_text":"RT @HouseCommerce: .@RepFredUpton &amp; @SenAlexander: #CuresNow up for House vote next week. More on agreement &gt;&gt; https://t.co/6XUIuoGinT","truncated":false,"display_text_range":[0,144],"entities":{"hashtags":[{"text":"CuresNow","indices":[55,64]}],"symbols":[],"user_mentions":[{"screen_name":"HouseCommerce","name":"Energy & Commerce GOP","id":114756202,"id_str":"114756202","indices":[3,17]},{"screen_name":"RepFredUpton","name":"Fred Upton #WearYourMask","id":124224165,"id_str":"124224165","indices":[20,33]},{"screen_name":"SenAlexander","name":"Sen. Lamar Alexander","id":76649729,"id_str":"76649729","indices":[40,53]}],"urls":[{"url":"https://t.co/6XUIuoGinT","expanded_url":"http://bit.ly/2gsjJbU","display_url":"bit.ly/2gsjJbU","indices":[121,144]}]},"source":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>","in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":15600527,"id_str":"15600527","name":"John Shimkus","screen_name":"RepShimkus","location":"Illinois, USA","description":"I represent the 15th Congressional District of Illinois and Chair the @HouseCommerce Environment Subcommittee.","url":"https://t.co/mFzyexaRq4","entities":{"url":{"urls":[{"url":"https://t.co/mFzyexaRq4","expanded_url":"http://shimkus.house.gov","display_url":"shimkus.house.gov","indices":[0,23]}]},"description":{"urls":[]}},"protected":false,"followers_count":31068,"friends_count":1082,"listed_count":1352,"created_at":"Fri Jul 25 17:01:16 +0000 2008","favourites_count":182,"utc_offset":null,"time_zone":null,"geo_enabled":true,"verified":true,"statuses_count":10422,"lang":null,"contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"709397","profile_background_image_url":"http://abs.twimg.com/images/themes/theme6/bg.gif","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme6/bg.gif","profile_background_tile":false,"profile_image_url":"http://pbs.twimg.com/profile_images/451465105838833664/CB7M3wqZ_normal.jpeg","profile_image_url_https":"https://pbs.twimg.com/profile_images/451465105838833664/CB7M3wqZ_normal.jpeg","profile_banner_url":"https://pbs.twimg.com/profile_banners/15600527/1497488884","profile_link_color":"FF3300","profile_sidebar_border_color":"86A4A6","profile_sidebar_fill_color":"A0C5C7","profile_text_color":"333333","profile_use_background_image":true,"has_extended_profile":false,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false,"translator_type":"none"},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Sat Nov 26 15:24:15 +0000 2016","id":802533432068833300,"id_str":"802533432068833280","full_text":".@RepFredUpton &amp; @SenAlexander: #CuresNow up for House vote next week. More on agreement &gt;&gt; https://t.co/6XUIuoGinT","truncated":false,"display_text_range":[0,125],"entities":{"hashtags":[{"text":"CuresNow","indices":[36,45]}],"symbols":[],"user_mentions":[{"screen_name":"RepFredUpton","name":"Fred Upton #WearYourMask","id":124224165,"id_str":"124224165","indices":[1,14]},{"screen_name":"SenAlexander","name":"Sen. Lamar Alexander","id":76649729,"id_str":"76649729","indices":[21,34]}],"urls":[{"url":"https://t.co/6XUIuoGinT","expanded_url":"http://bit.ly/2gsjJbU","display_url":"bit.ly/2gsjJbU","indices":[102,125]}]},"source":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>","in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":114756202,"id_str":"114756202","name":"Energy & Commerce GOP","screen_name":"HouseCommerce","location":"Washington, D.C.","description":"House Energy & Commerce Committee Republicans | Republican Leader @repgregwalden","url":"https://t.co/N1b9OueDJW","entities":{"url":{"urls":[{"url":"https://t.co/N1b9OueDJW","expanded_url":"http://republicans-energycommerce.house.gov/","display_url":"republicans-energycommerce.house.gov","indices":[0,23]}]},"description":{"urls":[]}},"protected":false,"followers_count":32689,"friends_count":1090,"listed_count":1254,"created_at":"Tue Feb 16 14:18:06 +0000 2010","favourites_count":682,"utc_offset":null,"time_zone":null,"geo_enabled":true,"verified":true,"statuses_count":16194,"lang":null,"contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"050000","profile_background_image_url":"http://abs.twimg.com/images/themes/theme7/bg.gif","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme7/bg.gif","profile_background_tile":false,"profile_image_url":"http://pbs.twimg.com/profile_images/1080261658612850689/ZyZO-MCx_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/1080261658612850689/ZyZO-MCx_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/114756202/1504288315","profile_link_color":"A79777","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"C7C5C5","profile_text_color":"28539E","profile_use_background_image":true,"has_extended_profile":false,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false,"translator_type":"none"},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":16,"favorite_count":8,"favorited":false,"retweeted":false,"possibly_sensitive":false,"lang":"en"},"is_quote_status":false,"retweet_count":16,"favorite_count":0,"favorited":false,"retweeted":false,"possibly_sensitive":false,"lang":"en"}
{"created_at":"Tue Jan 03 23:43:22 +0000 2017","id":816429779691810800,"id_str":"816429779691810816","full_text":"As the 115th Congress begins, remember: most Americans did not vote for @realDonaldTrump. Despite GOP's claims, he has no mandate to lead. https://t.co/auj3aNuixe","truncated":false,"display_text_range":[0,138],"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"realDonaldTrump","name":"Donald J. Trump","id":25073877,"id_str":"25073877","indices":[72,88]}],"urls":[],"media":[{"id":816428909378027500,"id_str":"816428909378027521","indices":[139,162],"media_url":"http://pbs.twimg.com/amplify_video_thumb/816428909378027521/img/c_eLha3YIrOaI8_E.jpg","media_url_https":"https://pbs.twimg.com/amplify_video_thumb/816428909378027521/img/c_eLha3YIrOaI8_E.jpg","url":"https://t.co/auj3aNuixe","display_url":"pic.twitter.com/auj3aNuixe","expanded_url":"https://twitter.com/SenJeffMerkley/status/816429779691810816/video/1","type":"photo","sizes":{"thumb":{"w":150,"h":150,"resize":"crop"},"large":{"w":640,"h":360,"resize":"fit"},"medium":{"w":640,"h":360,"resize":"fit"},"small":{"w":640,"h":360,"resize":"fit"}}}]},"extended_entities":{"media":[{"id":816428909378027500,"id_str":"816428909378027521","indices":[139,162],"media_url":"http://pbs.twimg.com/amplify_video_thumb/816428909378027521/img/c_eLha3YIrOaI8_E.jpg","media_url_https":"https://pbs.twimg.com/amplify_video_thumb/816428909378027521/img/c_eLha3YIrOaI8_E.jpg","url":"https://t.co/auj3aNuixe","display_url":"pic.twitter.com/auj3aNuixe","expanded_url":"https://twitter.com/SenJeffMerkley/status/816429779691810816/video/1","type":"video","sizes":{"thumb":{"w":150,"h":150,"resize":"crop"},"large":{"w":640,"h":360,"resize":"fit"},"medium":{"w":640,"h":360,"resize":"fit"},"small":{"w":640,"h":360,"resize":"fit"}},"video_info":{"aspect_ratio":[16,9],"duration_millis":76026,"variants":[{"bitrate":832000,"content_type":"video/mp4","url":"https://video.twimg.com/amplify_video/816428909378027521/vid/640x360/HASi1AQ8hhRUQEk6.mp4"},{"content_type":"application/x-mpegURL","url":"https://video.twimg.com/amplify_video/816428909378027521/pl/oittV30KVoegL_Xx.m3u8"},{"bitrate":320000,"content_type":"video/mp4","url":"https://video.twimg.com/amplify_video/816428909378027521/vid/320x180/YAOPWwS9e0oilzdY.mp4"}]},"additional_media_info":{"title":"","description":"","embeddable":true,"monetizable":false}}]},"source":"<a href=\"https://studio.twitter.com\" rel=\"nofollow\">Twitter Media Studio</a>","in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":29201047,"id_str":"29201047","name":"Senator Jeff Merkley","screen_name":"SenJeffMerkley","location":"Oregon","description":"U.S. Senator from the State of Oregon. Instagram: https://t.co/pECsFBm1G1","url":"https://t.co/uoOTYAkx5J","entities":{"url":{"urls":[{"url":"https://t.co/uoOTYAkx5J","expanded_url":"http://merkley.senate.gov","display_url":"merkley.senate.gov","indices":[0,23]}]},"description":{"urls":[{"url":"https://t.co/pECsFBm1G1","expanded_url":"http://instagram.com/senjeffmerkley","display_url":"instagram.com/senjeffmerkley","indices":[50,73]}]}},"protected":false,"followers_count":486813,"friends_count":1008,"listed_count":4494,"created_at":"Mon Apr 06 13:38:39 +0000 2009","favourites_count":1146,"utc_offset":null,"time_zone":null,"geo_enabled":true,"verified":true,"statuses_count":12018,"lang":null,"contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"457292","profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":false,"profile_image_url":"http://pbs.twimg.com/profile_images/873324219630854144/-7ZzOONo_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/873324219630854144/-7ZzOONo_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/29201047/1473699074","profile_link_color":"93A427","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"D9E8F2","profile_text_color":"333333","profile_use_background_image":false,"has_extended_profile":false,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false,"translator_type":"none"},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":410,"favorite_count":754,"favorited":false,"retweeted":false,"possibly_sensitive":false,"lang":"en"}
{"created_at":"Sun May 08 17:12:40 +0000 2011","id":67275736951685120,"id_str":"67275736951685121","full_text":"I'll be on \"Inspector America\" focusing on the San Bruno pipeline explosion, History Channel, 10p ET/11p PT. http://bit.ly/mvFJIH","truncated":false,"display_text_range":[0,129],"entities":{"hashtags":[],"symbols":[],"user_mentions":[],"urls":[]},"source":"<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>","in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":24913074,"id_str":"24913074","name":"Jackie Speier","screen_name":"RepSpeier","location":"","description":"Fighter for women’s equality, our troops, LGBTQ rights & the disenfranchised. Proud mom of 2 kids & puppy Emma, wife & Rep of CA-14-biotech and social media hub","url":"https://t.co/kqVDVfirna","entities":{"url":{"urls":[{"url":"https://t.co/kqVDVfirna","expanded_url":"http://speier.house.gov","display_url":"speier.house.gov","indices":[0,23]}]},"description":{"urls":[]}},"protected":false,"followers_count":172940,"friends_count":21377,"listed_count":2198,"created_at":"Tue Mar 17 17:02:38 +0000 2009","favourites_count":600,"utc_offset":null,"time_zone":null,"geo_enabled":true,"verified":true,"statuses_count":7098,"lang":null,"contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":false,"profile_image_url":"http://pbs.twimg.com/profile_images/1216755169163202560/7Y1JaC3s_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/1216755169163202560/7Y1JaC3s_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/24913074/1600290037","profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"has_extended_profile":false,"default_profile":true,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false,"translator_type":"none"},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":1,"favorite_count":0,"favorited":false,"retweeted":false,"lang":"en"}
{"created_at":"Tue Nov 27 00:05:25 +0000 2018","id":1067207732711907300,"id_str":"1067207732711907329","full_text":"@MattLaslo As Sartre would have surely said, \"Hell is other Twitter people.\"","truncated":false,"display_text_range":[11,76],"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"MattLaslo","name":"Matt Laslo","id":26607712,"id_str":"26607712","indices":[0,10]}],"urls":[]},"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>","in_reply_to_status_id":1067204571611697200,"in_reply_to_status_id_str":"1067204571611697152","in_reply_to_user_id":26607712,"in_reply_to_user_id_str":"26607712","in_reply_to_screen_name":"MattLaslo","user":{"id":404132211,"id_str":"404132211","name":"Rep. Rick Larsen","screen_name":"RepRickLarsen","location":"Washington's 2nd District","description":"Born and raised in Arlington, Wash. Chair, Aviation Subcommittee","url":"https://t.co/E2eWkKRAE2","entities":{"url":{"urls":[{"url":"https://t.co/E2eWkKRAE2","expanded_url":"http://larsen.house.gov","display_url":"larsen.house.gov","indices":[0,23]}]},"description":{"urls":[]}},"protected":false,"followers_count":20895,"friends_count":5749,"listed_count":865,"created_at":"Thu Nov 03 13:59:42 +0000 2011","favourites_count":2978,"utc_offset":null,"time_zone":null,"geo_enabled":true,"verified":true,"statuses_count":11408,"lang":null,"contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":false,"profile_image_url":"http://pbs.twimg.com/profile_images/804085552064720896/HGM6mA6m_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/804085552064720896/HGM6mA6m_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/404132211/1596643911","profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"has_extended_profile":false,"default_profile":true,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false,"translator_type":"none"},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":4,"favorited":false,"retweeted":false,"lang":"en"}
{"created_at":"Mon May 14 20:39:31 +0000 2012","id":202136077958512640,"id_str":"202136077958512640","full_text":"Congrats to Elizabeth City State University, Chowan University and Wilson Community College on continued funding... http://t.co/BKdXi9B5","truncated":false,"display_text_range":[0,136],"entities":{"hashtags":[],"symbols":[],"user_mentions":[],"urls":[{"url":"http://t.co/BKdXi9B5","expanded_url":"http://fb.me/NiWDXsov","display_url":"fb.me/NiWDXsov","indices":[116,136]}]},"source":"<a href=\"http://www.facebook.com/twitter\" rel=\"nofollow\">Facebook</a>","in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":432676344,"id_str":"432676344","name":"G. K. Butterfield","screen_name":"GKButterfield","location":"","description":"U.S. Representative, proudly serving North Carolina's 1st Congressional District. Follow my work in Washington and #NC01.","url":"https://t.co/TqwPgzUt7E","entities":{"url":{"urls":[{"url":"https://t.co/TqwPgzUt7E","expanded_url":"http://www.facebook.com/CongressmanGKButterfield","display_url":"facebook.com/CongressmanGKB…","indices":[0,23]}]},"description":{"urls":[]}},"protected":false,"followers_count":21493,"friends_count":1061,"listed_count":893,"created_at":"Fri Dec 09 17:17:22 +0000 2011","favourites_count":1885,"utc_offset":null,"time_zone":null,"geo_enabled":true,"verified":true,"statuses_count":6472,"lang":null,"contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"022330","profile_background_image_url":"http://abs.twimg.com/images/themes/theme15/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme15/bg.png","profile_background_tile":false,"profile_image_url":"http://pbs.twimg.com/profile_images/1108574414382264323/HQrhf_eC_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/1108574414382264323/HQrhf_eC_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/432676344/1597760036","profile_link_color":"005FB3","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"C0DFEC","profile_text_color":"333333","profile_use_background_image":false,"has_extended_profile":false,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false,"translator_type":"none"},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"favorited":false,"retweeted":false,"possibly_sensitive":false,"lang":"en"}
{"created_at":"Thu Apr 26 13:30:23 +0000 2018","id":989496918035390500,"id_str":"989496918035390465","full_text":"Mike Pompeo will make a great Secretary of State and should be overwhelmingly confirmed. That the vast majority of Dem senators will oppose him just demonstrates that their default policy reflex is knee-jerk opposition to the Trump administration.","truncated":false,"display_text_range":[0,247],"entities":{"hashtags":[],"symbols":[],"user_mentions":[],"urls":[]},"source":"<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>","in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":1058807868,"id_str":"1058807868","name":"Ron DeSantis","screen_name":"GovRonDeSantis","location":"","description":"46th Governor of the great state of Florida.","url":null,"entities":{"description":{"urls":[]}},"protected":false,"followers_count":575693,"friends_count":1314,"listed_count":2485,"created_at":"Thu Jan 03 21:20:17 +0000 2013","favourites_count":48,"utc_offset":null,"time_zone":null,"geo_enabled":true,"verified":true,"statuses_count":4328,"lang":null,"contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"B31616","profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":false,"profile_image_url":"http://pbs.twimg.com/profile_images/1138060146662354946/6jR-b4Yy_normal.png","profile_image_url_https":"https://pbs.twimg.com/profile_images/1138060146662354946/6jR-b4Yy_normal.png","profile_banner_url":"https://pbs.twimg.com/profile_banners/1058807868/1560169684","profile_link_color":"0084B4","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":false,"has_extended_profile":false,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false,"translator_type":"none"},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":1004,"favorite_count":4280,"favorited":false,"retweeted":false,"lang":"en"}
{"created_at":"Thu Feb 04 13:21:28 +0000 2016","id":695235725869125600,"id_str":"695235725869125633","full_text":"RT @NBCNews: Contraception fell, Medicaid births rose when Texas defunded Planned Parenthood https://t.co/IFMBev6MF5 https://t.co/heg5rWkE23","truncated":false,"display_text_range":[0,140],"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"NBCNews","name":"NBC News","id":14173315,"id_str":"14173315","indices":[3,11]}],"urls":[{"url":"https://t.co/IFMBev6MF5","expanded_url":"http://nbcnews.to/1KrQBeZ","display_url":"nbcnews.to/1KrQBeZ","indices":[93,116]}],"media":[{"id":695235425410142200,"id_str":"695235425410142208","indices":[117,140],"media_url":"http://pbs.twimg.com/media/CaX475bW0AAmPIv.jpg","media_url_https":"https://pbs.twimg.com/media/CaX475bW0AAmPIv.jpg","url":"https://t.co/heg5rWkE23","display_url":"pic.twitter.com/heg5rWkE23","expanded_url":"https://twitter.com/NBCNews/status/695235425791778817/photo/1","type":"photo","sizes":{"small":{"w":560,"h":320,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"},"medium":{"w":560,"h":320,"resize":"fit"},"large":{"w":560,"h":320,"resize":"fit"}},"source_status_id":695235425791778800,"source_status_id_str":"695235425791778817","source_user_id":14173315,"source_user_id_str":"14173315"}]},"extended_entities":{"media":[{"id":695235425410142200,"id_str":"695235425410142208","indices":[117,140],"media_url":"http://pbs.twimg.com/media/CaX475bW0AAmPIv.jpg","media_url_https":"https://pbs.twimg.com/media/CaX475bW0AAmPIv.jpg","url":"https://t.co/heg5rWkE23","display_url":"pic.twitter.com/heg5rWkE23","expanded_url":"https://twitter.com/NBCNews/status/695235425791778817/photo/1","type":"photo","sizes":{"small":{"w":560,"h":320,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"},"medium":{"w":560,"h":320,"resize":"fit"},"large":{"w":560,"h":320,"resize":"fit"}},"source_status_id":695235425791778800,"source_status_id_str":"695235425791778817","source_user_id":14173315,"source_user_id_str":"14173315"}]},"source":"<a href=\"http://twitter.com/download/android\" rel=\"nofollow\">Twitter for Android</a>","in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":404132211,"id_str":"404132211","name":"Rep. Rick Larsen","screen_name":"RepRickLarsen","location":"Washington's 2nd District","description":"Born and raised in Arlington, Wash. Chair, Aviation Subcommittee","url":"https://t.co/E2eWkKRAE2","entities":{"url":{"urls":[{"url":"https://t.co/E2eWkKRAE2","expanded_url":"http://larsen.house.gov","display_url":"larsen.house.gov","indices":[0,23]}]},"description":{"urls":[]}},"protected":false,"followers_count":20895,"friends_count":5749,"listed_count":865,"created_at":"Thu Nov 03 13:59:42 +0000 2011","favourites_count":2978,"utc_offset":null,"time_zone":null,"geo_enabled":true,"verified":true,"statuses_count":11408,"lang":null,"contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":false,"profile_image_url":"http://pbs.twimg.com/profile_images/804085552064720896/HGM6mA6m_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/804085552064720896/HGM6mA6m_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/404132211/1596643911","profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"has_extended_profile":false,"default_profile":true,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false,"translator_type":"none"},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweeted_status":{"created_at":"Thu Feb 04 13:20:17 +0000 2016","id":695235425791778800,"id_str":"695235425791778817","full_text":"Contraception fell, Medicaid births rose when Texas defunded Planned Parenthood https://t.co/IFMBev6MF5 https://t.co/heg5rWkE23","truncated":false,"display_text_range":[0,127],"entities":{"hashtags":[],"symbols":[],"user_mentions":[],"urls":[{"url":"https://t.co/IFMBev6MF5","expanded_url":"http://nbcnews.to/1KrQBeZ","display_url":"nbcnews.to/1KrQBeZ","indices":[80,103]}],"media":[{"id":695235425410142200,"id_str":"695235425410142208","indices":[104,127],"media_url":"http://pbs.twimg.com/media/CaX475bW0AAmPIv.jpg","media_url_https":"https://pbs.twimg.com/media/CaX475bW0AAmPIv.jpg","url":"https://t.co/heg5rWkE23","display_url":"pic.twitter.com/heg5rWkE23","expanded_url":"https://twitter.com/NBCNews/status/695235425791778817/photo/1","type":"photo","sizes":{"small":{"w":560,"h":320,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"},"medium":{"w":560,"h":320,"resize":"fit"},"large":{"w":560,"h":320,"resize":"fit"}}}]},"extended_entities":{"media":[{"id":695235425410142200,"id_str":"695235425410142208","indices":[104,127],"media_url":"http://pbs.twimg.com/media/CaX475bW0AAmPIv.jpg","media_url_https":"https://pbs.twimg.com/media/CaX475bW0AAmPIv.jpg","url":"https://t.co/heg5rWkE23","display_url":"pic.twitter.com/heg5rWkE23","expanded_url":"https://twitter.com/NBCNews/status/695235425791778817/photo/1","type":"photo","sizes":{"small":{"w":560,"h":320,"resize":"fit"},"thumb":{"w":150,"h":150,"resize":"crop"},"medium":{"w":560,"h":320,"resize":"fit"},"large":{"w":560,"h":320,"resize":"fit"}}}]},"source":"<a href=\"http://www.hootsuite.com\" rel=\"nofollow\">Hootsuite</a>","in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":14173315,"id_str":"14173315","name":"NBC News","screen_name":"NBCNews","location":"New York, NY","description":"News updates from around the 🌎, all day, every day. Home of @NBCBLK, @NBCLatino, @NBCAsianAmerica, @NBCOUT & more.","url":"https://t.co/HOBYJP3M3J","entities":{"url":{"urls":[{"url":"https://t.co/HOBYJP3M3J","expanded_url":"http://NBCNews.com/PlanYourVote","display_url":"NBCNews.com/PlanYourVote","indices":[0,23]}]},"description":{"urls":[]}},"protected":false,"followers_count":7788584,"friends_count":1840,"listed_count":43913,"created_at":"Tue Mar 18 23:19:17 +0000 2008","favourites_count":784,"utc_offset":null,"time_zone":null,"geo_enabled":true,"verified":true,"statuses_count":274670,"lang":null,"contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"062131","profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":true,"profile_image_url":"http://pbs.twimg.com/profile_images/1108426393287868423/CyLn5GVQ_normal.png","profile_image_url_https":"https://pbs.twimg.com/profile_images/1108426393287868423/CyLn5GVQ_normal.png","profile_banner_url":"https://pbs.twimg.com/profile_banners/14173315/1593479928","profile_link_color":"5172A0","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"FFFFFF","profile_text_color":"000000","profile_use_background_image":true,"has_extended_profile":false,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false,"translator_type":"none"},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":159,"favorite_count":101,"favorited":false,"retweeted":false,"possibly_sensitive":false,"lang":"en"},"is_quote_status":false,"retweet_count":159,"favorite_count":0,"favorited":false,"retweeted":false,"possibly_sensitive":false,"lang":"en"}
{"created_at":"Mon Dec 16 23:19:38 +0000 2013","id":412723723255283700,"id_str":"412723723255283712","full_text":"Sen. Franken presses for increasing investments in clean energy to support 21st-century economy &amp; grow MN industry. http://t.co/kbKBAtPVxT","truncated":false,"display_text_range":[0,142],"entities":{"hashtags":[],"symbols":[],"user_mentions":[],"urls":[{"url":"http://t.co/kbKBAtPVxT","expanded_url":"http://1.usa.gov/19ObGZ6","display_url":"1.usa.gov/19ObGZ6","indices":[120,142]}]},"source":"<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</a>","in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":171968009,"id_str":"171968009","name":"U.S. Senator Al Franken","screen_name":"SenFranken","location":"","description":"Former U.S. Senator Al Franken of Minnesota.","url":null,"entities":{"description":{"urls":[]}},"protected":false,"followers_count":446872,"friends_count":1107,"listed_count":2255,"created_at":"Wed Jul 28 16:11:35 +0000 2010","favourites_count":221,"utc_offset":null,"time_zone":null,"geo_enabled":true,"verified":true,"statuses_count":3991,"lang":null,"contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"022241","profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":false,"profile_image_url":"http://pbs.twimg.com/profile_images/885344348518449152/Fm9yfbNh_normal.jpg","profile_image_url_https":"https://pbs.twimg.com/profile_images/885344348518449152/Fm9yfbNh_normal.jpg","profile_banner_url":"https://pbs.twimg.com/profile_banners/171968009/1508181962","profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"has_extended_profile":false,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false,"translator_type":"none"},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"favorited":false,"retweeted":false,"possibly_sensitive":false,"lang":"en"}

and yes I noticed the short tweet ids too...I haven't checked yet whether they were successfully fetched but I can follow up if that would be of interest to you.

edsu commented 3 years ago

@leslie-huang interesting! So am I reading your note correctly that the same ids were requested over and over?

I didn't see anything wrong with the line of JSON you pasted. But it was printed out as a Python object ("false" was False, null was None, etc).

I was able to hydrate both datasets concurrently. But it feels like perhaps your Hydrator ran into a network or authentication problem that wasn't handled properly. This might be related to work that needs to happen on #57

Screenshot from 2020-10-04 06-44-27

leslie-huang commented 3 years ago

Yes, when I looked at the unique tweet ids in the jsonl file of ~2 million tweets, there were only ~1.2 million unique tweet ids, so about 800k duplicates. I didn't look closely at whether it was 800k copies of one tweet (for example) or a different breakdown, but since then I've hydrated a few more lists of tweets without any issues!

This tool has saved me a lot of time in putting together a dataset, thank you for maintaining it!

edsu commented 3 years ago

Thanks for noticing the repeated tweet identifiers in the jsonl, It will help me diagnose whet might be going on here. I'm glad to hear that it is working again!

Gautamshahi commented 3 years ago

Hi,

I don't see any solution for the issue. I counted the lines which more than the actual number of tweet IDs.

Regards,

edsu commented 3 years ago

Yes this issue is still open. I think that the bug is related to some kind of unhandled network error, or perhaps an API error during hydration. If you can easily share your ids and your jsonl file with me at ehs@pobox.com it might help me diagnose what is going on.

Gautamshahi commented 3 years ago

Okay, thank you, yes I can share the ids, actually this error happening for only a few files. One more point, sometimes green colour overflows the progress bar while sometimes it stays in the middle even of progress bar even it already crawled all tweets.

edsu commented 3 years ago

Thanks. That's good to know it is working sometimes. So does the same tweet ID file repeatedly cause a problem? If so that would be very helpful for me to test with.

Gautamshahi commented 3 years ago

I sent the file, and it has around 30 millions tweet Id. Please update if you find the bug. Yes, the same file. I tried 10-12 big files. Only 2 had the issue, I shared one file with you.

edsu commented 3 years ago

Thanks! So just to be clear: you have attempted to hydrate this file more than once and it has created the same problem?

Gautamshahi commented 3 years ago

Welcome :) Yes, I tried 2 times.

edsu commented 3 years ago

Experimenting with a small set of ids while flipping my wifi connection on and off resulted in getting the progress bar to overflow (see Short Test 4 in the screeshot beow). But this didn't happen reliably: sometimes the hydration finished ok. So clearly his error seems related to a timing issue. I'm guessing it is Promise related down utils.twitter. I noticed this error on my console (I was running in development mode).

{"errno":"EAI_AGAIN","code":null,"syscall":"getaddrinfo","hostname":"api.twitter.com","statusCode":null,"allErrors":[],"twitterReply":""}
unexpected error during hydration: Error: getaddrinfo EAI_AGAIN api.twitter.com, sleeping 10000

When I took a look at the hydrated data for Short Test 4 I could see that 157 of the tweet ids were fetched twice. I think there must be some kind of error condition where multiple asynchronous requests are being made for the same set of tweet ids. This results in the fetched tweet ids overrunning the total number of tweet ids in the dataset being hydrated.

I'll keep investigating but wanted to drop some notes in here so I remembered them.

Screenshot from 2020-10-29 09-17-05

Gautamshahi commented 3 years ago

Thanks a lot for your effort. Do you still need JSON file from my side?

edsu commented 3 years ago

No, I don't think I need the jsonl file @Gautamshahi. If you are on a Unix system and want to see if your jsonl contains duplicates you can do this (assuming you have jq installed).

jq -r .id_str tweets.jsonl | sort | uniq -c  | sort -n

If you see lines at the end that start with a number other than 1 that means you got duplicates too.

edsu commented 3 years ago

I can replicate it reliably now by:

  1. starting hydration
  2. turning off wifi
  3. stopping hydration (clicking stop button)
  4. restarting hydration (clicking start button)
  5. re-enabling wifi
  6. letting hydration complete

This makes me happy, because now it's possible to fix it!

Gautamshahi commented 3 years ago

Hi, Did you find any solution for it?

Any lead will be a great help.