DMArchiver is broken - Githubissues

cajuncooks commented 4 years ago

I'm really just kind of hoping to open a dialogue here; I have no idea if there's anything we can reasonably do to solve this issue, but I mostly just want to hear that this is actually affecting somebody else. Earlier this week, this started happening to me:

$ dmarchiver
DMArchiver 0.2.6
Running on Python 3.7.6 (default, Jan  8 2020, 19:59:22) 
[GCC 7.3.0]

[...]

Press Ctrl+C at anytime to write the current conversation and skip to the next one.
 Keep it pressed to exit the script.

Conversation ID not specified. Retrieving all the threads.
Expecting value: line 1 column 1 (char 0)

The last line there is the JSON parser failing, because the POST request doesn't give a valid response. This happens whether or not you specify a conversation ID; it seems that all messages URLs that were in use now fail. Authentication still works, but ~nothing else related to DMs, as far as I can tell. I tried several different changes to the headers passed into the request, but nothing that produced actual fruitful results.

I went through some of #79 seeking an alternate solution, but the API endpoints mentioned in there seem to not exist anymore, or are locked behind some kind of additional authentication, despite my API application having permissions for DM access.

$ twurl -X GET /1.1/dm/conversation/[conversation_id].json
{"errors":[{"message":"Your credentials do not allow access to this resource","code":220}]}

This is not specific to twurl, either... as I noted over in bear/python-twitter#665 (a PR which updates a deprecated DM endpoint in python-twitter), I get useless output there, an empty events array (when according to Twitter's own documentation, "[i]n rare cases the events array may be empty"). e.g.:

In [4]: api.GetDirectMessages(return_json=True)                                                                       
Out[4]: {'events': [], 'next_cursor': 'MTI5NDc2OTE5NTc1MDc3Mjc0Nw'}

This matches my experience using twurl as suggested in that documentation, too.

I'm hypothesizing that this all has something to do with the breach Twitter experienced last month and their development for the v2.0 API, but the gut punch is that API access to DMs is listed under "Nesting" (the least-developed column, it seems) on the roadmap, which means that we may be months from a solution if the methods used in this application are no longer viable. I'd love to contribute to a solution here that doesn't involve an always-running selenium webdriver or some other related nonsense, but I'm not sure how to approach it.

Mincka commented 4 years ago

Hi,

Indeed, it looks like that's the end of DMArchiver and its HTML parsing method. At some point, it was expected that they would drop this interface to use only JavaScript and the API. I tried to disable JavaScript and https://mobile.twitter.com/ is still accessible but not usable (only few messages are shown in conversations).

Now, it seems there is a difference between using the official API and using it "through the browser". If we were users of the API, we would face the same limitations and it would not be possible to retrieve all the DMs of a conversation. I just tried to stupidly scroll up in a conversation with thousands of messages and I was still able to retrieve everything. It does not mean however there is no limitation.

When inspecting a GET /1.1/dm/conversation/1001168196991373314.json request, we can see this in the headers:

access-control-expose-headers: X-Rate-Limit-Limit, X-Rate-Limit-Remaining, X-Rate-Limit-Reset
x-rate-limit-limit: 900
x-rate-limit-remaining: 788
x-rate-limit-reset: 1597555966

It looks like 900 requests with 20 messages per 15 minutes.

If the API is called exactly as a browser, I think we could avoid Selenium.

cajuncooks commented 4 years ago

Aha... I see those GET request now in the inspector -- must just not be exposed via the normal API protocols. So maybe it's just a matter of passing the right authentication/cookies/headers? I will look into this tomorrow.

Mincka commented 4 years ago

The basic flow is the following:

Authenticate on POST /sessions
Get conversations on GET /1.1/dm/inbox_initial_state.json or GET /1.1/dm/inbox_timeline/trusted.json for the conversations not loaded at logon. Not sure if looping is necessary for the latest. I don't have enough conversations.
Get messages on GET /1.1/dm/conversation/2350610210-788777070428032928.json. The max_id=1073460340120322053 URL parameter is used to know the position in the thread.

Loop based on the value of status that can be be HAS_MORE or AT_END

{"conversation_timeline":{"status":"HAS_MORE","min_entry_id":"1202510056522140100","max_entry_id":"1251082889214502789",
{"conversation_timeline":{"status":"AT_END","min_entry_id":"952320302367985669","max_entry_id":"952981962960540420",

Monitor the API limits with the headers

The requests must have the session cookies. This is not the hard part, requests can handle it. E.g. auth_token is the identity of the user and is set with the response of POST /sessions

The requests must also have a Bearer that looks like this: authorization: Bearer AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzXejRCOuH5E1I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHpKLTvJu4FA65AGWWjCpTnA

It's to authorize the browser, as a client, to access the API.

I found out that this bearer is returned in the response of GET /responsive-web/client-web/main.05e1f885.js

Somewhere in the middle of the code:

const r="ACTION_FLUSH",i="ACTION_REFRESH",o="3033300",s="Web-12",a="AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzXejRCOuH5E1I8xnZz4puTs%3D1Zv7ttfk8LF81IUq16cHpKLTvJu4FA65AGWWjCpTnA",c="14191373",l="ct0",u="x-csrf-token",d="eu_cn",p="ab_decider",h="fm",m="gt",f="responsive_web",b="_mb_tk",g="night_mode",_="rweb",y="m5",v="LiteNativeWrapper",w="/sw.js",E="_sl",O="tombstone://card",T="twid",I="TwitterAndroidLite",S=new Uint8Array([4,94,104,18,141,49,13,74,96,202,82,131,78,91,29,242,150,101,197,0,53,149,230,8,54,38,62,173,43,28,89,130,191,222,213,128,147,62,21,49,187,95,212,194,196,212,140,157,234,34,8,245,143,158,221,15,83,8,222,111,100,204,213,48,75])

I don't how how frequently this could change. From what I can find on Google, looks like almost a hardcoded value for some time now: https://www.google.com/search?client=firefox-b-d&q=AAAAAAAAAAAAAAAAAAAAANRILgAAAAAAnNwIzUejRCOuH5E6I8xnZz4puTs%253D1Zv7ttfk8LF81IUq16cHjhLTvJu4FA33AGWWjCpTnA

Anyway, this would require a complete rewrite of DMArchiver since the parsing of the JSON is completely different of the HTML parsing. Maybe there are already libraries that do this kind of thing. Text could be ok, but attachments (images, videos, tweets, links require more work).

Mincka commented 4 years ago

I confirm that you may now be unable to read your own private messages in the browser. The loader will just spin and you will have to wait for the next 15-minute window.

The response will look like this:

HTTP/1.1 429 Too Many Requests
access-control-allow-credentials: true
access-control-allow-origin: https://twitter.com
access-control-expose-headers: X-Rate-Limit-Limit, X-Rate-Limit-Remaining, X-Rate-Limit-Reset
cache-control: no-cache, no-store, max-age=0
connection: close
Content-Length: 56
content-type: application/json;charset=utf-8
date: Sun, 16 Aug 2020 08:24:44 GMT
server: tsa_o
strict-transport-security: max-age=631138519
x-connection-hash: 156a8755395be1da0e6871fdeae75079
x-rate-limit-limit: 900
x-rate-limit-remaining: 0
x-rate-limit-reset: 1597566893
x-response-time: 111

{"errors":[{"message":"Rate limit exceeded","code":88}]}

cajuncooks commented 4 years ago

I tried setting up an app on the Twitter developer site and getting a bearer token using oauthlib, but I would get access denied with every OAuth2 token except for the public one, no matter what permissions I set. For now, GETting that .js file and parsing it with re.findall('(AAAAAA.*?)\"',str(response.content)) will do, until Twitter changes that, too.

Anyway, using your workflow as an outline, I was eventually able to properly set the headers and iterate through some GET calls to a conversation endpoint. We can randomly generate the x-csrf-token and corresponding ct0 value for the header's cookie field, and everything else in the headers is static or derived from either the requests state or the request itself.

I don't mind working through some of the parser details; I think the general data delivered by the API is quite similar, it just needs to be parsed in JSON rather than through lxml.html, which should make things much simpler in the end. With some fiddling I was also able to login with 2FA enabled, so I can likely address #26 too.

Careful attention should be paid to rate limiting, for sure... the rate limit works out 1 per second, but as you note, the worst thing that happens from the user's perspective is that they're locked out of their DMs for 15 minutes, which isn't so bad. I definitely think this is worth pursuing, and I'm hopeful that I'll have a branch to share with you late this week or next weekend.

Mincka commented 4 years ago

We can't go through the standard OAuth2 flow for DMs I think. It's still best to simulate a user in the browser. The downside is to enter login and password. A slightly better solution would be to manually extract and enter the auth_token but it would be too complicated for end users.

I checked the x-csrf-token and it's session-based so there is nothing to do if using sessions from requests, it will be transparent.

And sure, dropping lxml can just be a good news. I think it was a poor decision for a multi-platform tool. Using the API may also prevent random parsing breaks due to HTML updates.

My "calculation" was completely wrong also. You're right, that's just 20 messages per second in the end. I didn't benchmark DMArchiver but I am quite certain that the rate was higher, at least 2 requests (20 messages) per second, for generated HTML...

To prevent lookup I think we need to use something like this with a default rate slightly below the maximum one, like 850 per 15 minutes. It should do the trick for the majority of the users. Anyway, a warning will be required.

Prisoner416 commented 4 years ago

Well. Shoot. Accidentally locked myself out running the tool, got the account back only to have the tool broken. Thanks Twitter hackers. Realistically, how long are we looking at for a working rewrite?

cajuncooks commented 4 years ago

Sorry for the delay here, real life gets in the way sometimes. As you'll read in the commit message above, I have a very rough draft of the new interpreter, which hasn't really been extensively tested, but it does work flawlessly on a <24h old group DM with a few hundred messages. I also tried to implement the 2FA process that I've been using, but my account has been locked for several hours now... hoping to be able to try again tomorrow.

Some things that are gone:

Cards (for links)
Embedded tweet cards
Alt Text

The presentation for these the way the interpreter used to parse them was done by Twitter on-site, so these are things that are fundamentally missing unless we manage to decode that API through their .js calls, too; I've replaced them as normal longform links in the text output. Listing conversations/grabbing from all conversations is probably still broken? Might be able to figure that one out, now. It's much, much easier to identify the conversation ID on the modern Twitter web interface, though.

Some other various comments... the twitter_handle option can actually be trivially re-implemented (change 'screen_name' to 'name' dependent on the flag), but I have dropped it in favor of defaulting to the handle. I don't think that the latest tweet ID check works at all, will need to look into that. In the 'entities' dict, hashtags and user handle mentions are also highlighted in addition to URLs, but I don't see the advantage of grabbing anything out of those, unless we wanted to embed links in the output (don't think this is a great idea, generally).

I don't know yet how well the rate-limiter works, will need to scrape a larger chat box that would take several hours to grab to see how well it behaves; right now it's set to the max (900 calls in 900 seconds). This would also test if all of the tweet_types have been accounted for, and (for an old enough chat box) if stickers still work.

Open to feedback on whatever; this is definitely not a finished product.

cajuncooks commented 4 years ago

Turns out that tweets are an embedded type, though only when the tweeter hasn't blocked you, interesting! Also figured out what I was missing on the latest ID thing. My branch should have more complete functionality now, as far as I can tell. Still needs more testing and cleanup, likely.

cajuncooks commented 4 years ago

One of the issues I'm encountering is that the image download links (https://ton.twitter.com/1.1/data/dm/[etc]) seem to require the API header call, and the limits on that are both not coming through on the response header and also appear to be much, much lower than the DM endpoint. Right now I'm just wrapping it in a while response.status_code == 429: sleep(60) and try again kind of block, but it would be good to find a better solution for this. The video/gif links are anonymously accessible, but the image links are not, for whatever reason.

NuLL3rr0r commented 4 years ago

I have been facing the same issue since a few months ago. Is there any working version or a work in progress that could be checked out?

cajuncooks commented 4 years ago

Not sure what kind of testing @Mincka would want to do, but I've been running my forked branch (json_overhaul) within an existing larger framework (processing of the txt output into db + web front end) successfully for over a month now. I don't believe the hack I described for the image API rate limiting is in, but I can include it tomorrow. Think it needs some care before it would see an official release, though.

NuLL3rr0r commented 4 years ago

@cajuncooks thank you. Is it publicly available on GitHub?

cajuncooks commented 4 years ago

git clone https://github.com/cajuncooks/DMArchiver --branch json_overhaul, finally updated it with the silent hack around the image API.

Mincka commented 4 years ago

Thanks for the overhaul @cajuncooks. 👍 I merged your branch as a new base for the tool.

Can you confirm that you did not implement the retrieval of all conversations? It looks like I need to specify a conversion ID otherwise it crashes with:

Conversation ID not specified. Retrieving all the threads.
Expecting value: line 1 column 1 (char 0)

With a conversation id specified, it worked for about 20000 processed tweets and then stopped without saving with:

An error occured during the parsing of the tweets.

Twitter error details below:
Code 88: Rate limit exceeded

Stopping execution due to parsing error while retrieving the tweets.

I need to check how you handle the rate limiting. I think that's because the API_LIMIT is the best case scenario at 900 but if you browsed a bit on Twitter before or at the same time, the counter is lower than 900 and it encounters the error before the local throttling. You implemented the silent wait on 429 for images only from what I see. The logic must be the same for all the API calls. Maybe it would be simpler to drop the local rate limiting and only use the 429 response code to wait when necessary.

Finally, I have two types of parsing error that I need to investigate:

The first looks related to handling of stickers.

Unexpected error ''sticker'' for tweet '814111061578313732', raw JSON will be used for the tweet.
Traceback (most recent call last):
  File "core.py", line 591, in _process_tweets
KeyError: 'sticker'

This second one looks related to a parsing error when the name of the conversation is updated.

Unexpected error '' for tweet '813823287868424192', raw JSON will be used for the tweet.
Traceback (most recent call last):
  File "core.py", line 595, in _process_tweets

Still, it's a great work that brings new hopes for DMArchiver! 🎉

khawaisnaeem commented 3 years ago

Do something guys you are champ!!

Joshua7896 commented 3 years ago

Did DMArchiver die?

jeffhuang commented 3 years ago

First of all, thanks @Mincka for all your work on this, plus other contributors. We used dmarchiver for a while to export DMs for analysis for a research project. The other option for us is exporting the messages via Twitter's export tool, but that can take a few days to get the email.

Now unfortunately it looks like Twitter has disabled crawling/scraping by requiring javascript to do anything, even to get the authenticity_token. I'm not sure if there's a way to get around that without some major rework, so even the recent update by @cajuncooks is broken now. I'm going to look into other options, but it seems like exporting messages older than 30 days (older than the API allows) might be tricky. I wonder if anyone has tried using a headless chrome browser for something like this.

Mincka commented 3 years ago

Hi @jeffhuang, did you try to change the user agent? https://twitter.com/magusnn/status/1339830611343679490?s=20

Maybe it could help in this case. Thank you for the heads up anyway. Indeed, looks like a headless browser in the next hack.

jeffhuang commented 3 years ago

That's an interesting finding @Mincka and thank you for the suggestion. I'll look into it, but have to be cautious since our project is for federally-funded research, so we might not be so comfortable with mimicking the googlebot user agent. But if I try it, I'll post an update here.

scramblr commented 3 years ago

That's an interesting finding @Mincka and thank you for the suggestion. I'll look into it, but have to be cautious since our project is for federally-funded research, so we might not be so comfortable with mimicking the googlebot user agent. But if I try it, I'll post an update here.

Jeff could you possibly at least do a Proof of Concept on this and let other users decide if this falls within proper use of the boundaries of their programs? I mean no disrespect and fully understand where you're coming from, but this has some uses for reporters and others in very specific use-cases that supersede the stigma attached to "Spoofing" Googlebot, which isn't illegal or even unethical in my opinion.

Cheers, and thank you for your time.

Bebetternow22 commented 1 year ago

Has anyone found a way to fix this? I am not a coder, but I am trying to learn. I need to pull my own deleted DM's and I think this would really help if it still works. I will need help with this though - anyone willing to help a lady out? Thanks!

NuLL3rr0r commented 1 year ago

I don't think so. This used to work on the old Twitter front end. It has changed a lot since then. So, one should write a totally new scrapper. Though, it definitely cannot archive deleted DM's. For that, I guess you could download your Twitter Archive and probably parse XML stuff from what I remember.

Bebetternow22 commented 1 year ago

Hi, how do you parse xml stuff? See - I am totally confused about all of this. I have downloaded my archive several times, but the deleted dm's don't come over. Are you saying I could use the archive to "parse it" and it may pull the deleted dm's? How would I do this? Do you have any sample code you could share with me? Or, do you think I could hire you to pull this information for me? Sorry, but I am desperate to pull these deleted dm's. Thanks for the response.

On Tue, Feb 28, 2023 at 11:13 AM ᴍᴀᴍᴀᴅᴏᴜ ʙᴀʙᴀᴇɪ @.***> wrote:

I don't think so. This used to work on the old Twitter front end. It has changed a lot since then. So, one should write a totally new scrapper. Though, it definitely cannot archive deleted DM's. For that, I guess you could download your Twitter Archive and probably parse XML stuff from what I remember.

— Reply to this email directly, view it on GitHub https://github.com/Mincka/DMArchiver/issues/83#issuecomment-1448458749, or unsubscribe https://github.com/notifications/unsubscribe-auth/A5NDIEHVJD7ENSNMBETKIZTWZYP2ZANCNFSM4QARZRDQ . You are receiving this because you commented.Message ID: @.***>

Mincka / DMArchiver

DMArchiver is broken #83