JustAnotherArchivist / snscrape

A social networking service scraper in Python
GNU General Public License v3.0
4.31k stars 698 forks source link

Id mismatch? #972

Closed chmod closed 1 year ago

chmod commented 1 year ago

I am running the command snscrape --jsonl --progress --max-results 1 twitter-search "@Eikonikos_HQ filter:replies -from:Eikonikos_HQ" which currently returns (removed stuff for brevity)

{
  "_type": "snscrape.modules.twitter.Tweet",
  "url": "https://twitter.com/estfino/status/1669700650467246080",
  "id": 1669700650467246000,
  "conversationId": 1669700100971536400,
  "inReplyToTweetId": 1669700100971536400,
   .....
}

My understanding is that the last part of a tweet url is the tweet id. The current IDs provided link to error page.

Edit: I am using version of GitHub

JustAnotherArchivist commented 1 year ago

snscrape returns the correct IDs:

$ snscrape --jsonl --progress --max-results 1 twitter-search "@Eikonikos_HQ filter:replies -from:Eikonikos_HQ"
{"_type": "snscrape.modules.twitter.Tweet", "url": "https://twitter.com/estfino/status/1669700650467246080", "date": "2023-06-16T13:37:11+00:00", "rawContent": "<snip>", "renderedContent": "<snip>", "id": 1669700650467246080, "user": { <snip>

You are likely passing it through jq or a similar software which doesn't parse large integers correctly. (jq has fixed that bug some time ago, but it isn't released yet.) That's what mangles the IDs.

If you can't switch to a JSON parser that isn't broken, you can use --jsonl-for-buggy-int-parser to emit JSONL with additional id.str etc. string fields for each field with an integer exceeding the float precision.

chmod commented 1 year ago

Thank you for the reply. I'll update the jq.