[BUG] Simpcity added captcha to login, breaking SC scraping

wdos3 commented 1 year ago

Version information: Python 3.9.12, Windows 11

As mentioned above, simpcity scrape doesn't work, and bunkrr seems like it's blocked:

Simpcity simply returns an error message for the entire URL, possibly because it's simpcity.su instead of .st (though I am not sure whether that's the real reason, or the site has banned scraping, or traffic restrictions)
bunkrr seems to also suffer from traffic restrictions, as videos can be manually downloaded through the website instead of a scraper. If it helps, the scraped video returns the error twice (for example, it repeated "Error: https://bunkrr.su/v/0he0shdwyaj7wrz8pxc1j_source-FsDxNZEs.mp4" twice instead of only once)

Dev Note: SC is failing due to added captcha on the login page. A fix is in the works, but likely won't be quick to materialize.

Jules-WinnfieldX commented 1 year ago

So here's the gist.

Simpcity has turned on captcha verification for login. Nobody should be surprised that this breaks Cyberdrop-DL. I have a way around it but it's going to take a long time to implement.
Bunkrr completely removed the section of the page where I was previously able to get download links from. It's going to take a hot minute before I am able to get a new methodology working.

LuciferxR commented 1 year ago

This can't happen, then we won't be able to download anything else.

the444xg commented 1 year ago

Saving the cookies on a file(or db) would help minimize those issues in the future. It's not a fix for the captcha but if the user has a session cookie already it won't be affected by those changes or eventual html or other simple protections. I know you said you don't wanna deal with cookies, but it's not that much work, at least for the forums.

On the Bunkrr issue, seems like they are using js to encode the server url. Also they lowered the limits and for some users they are showing recaptcha. I don't know how it was before but the player is using api to get the final video url.

Jules-WinnfieldX commented 1 year ago

Saving the cookies on a file(or db) would help minimize those issues in the future. It's not a fix for the captcha but if the user has a session cookie already it won't be affected by those changes or eventual html or other simple protections. I know you said you don't wanna deal with cookies, but it's not that much work, at least for the forums.

On the Bunkrr issue, seems like they are using js to encode the server url. Also they lowered the limits and for some users they are showing recaptcha. I don't know how it was before but the player is using api to get the final video url.

Yea, I wanted to avoid messing with cookies as long as I could. Another project I have interest in has browser extraction baked in for the cookies. I'm going to go through with that. It's going to be part of a full rewrite though.

You're correct on the API for bunkr, I've been playing around with it this morning for the last hour, and it's full of 503 status'. I also need to figure out how this will apply to images / other filetypes. But I was just focused on video.

the444xg commented 1 year ago

You're correct on the API for bunkr, I've been playing around with it this morning for the last hour, and it's full of 503 status'. I also need to figure out how this will apply to images / other filetypes. But I was just focused on video.

So far only saw videos in the errors, maybe images are not affected.

Yea, I wanted to avoid messing with cookies as long as I could. Another project I have interest in has browser extraction baked in for the cookies. I'm going to go through with that. It's going to be part of a full rewrite though.

A temporary fix would be to use cookie extractor like https://raw.githubusercontent.com/instaloader/instaloader/master/docs/codesnippets/615_import_firefox_session.py

Or just a way to copy from browser and use it. Once you have the cookie it works fine, i believe i got one working for months.

After the login is fixed, something you can do to lower the number of requests normal users do is to implement the retry on failed/partial files. I feel like people would scrap everything again because it's hard to go one by one. No idea if this is too much work. For me if it was possible to add the thread id to the directory name, it would be much easier to keep track of those partial files with a script.

github-userx commented 1 year ago

@Jules-WinnfieldX looking forward to your rewrite and new implementations.

These site changes have once again been a great reminder to always locally backup files right away and never rely on streaming/cloudstorage sites as they can restrict content at any time or be gone completely from one day to another.

github-userx commented 1 year ago

What I do not understand is..bunkrr becomes basically useless for bulk upload/downloads if they don’t allow scrapers that get download 100 small videos in a row. It’s not like we’re downloading hundreds of videos manually,

StashPRs commented 1 year ago

W.r.t. bunkrr.su, it looks like we might not need cookies to accomplish this.

It looks like the API is as follows (using curl to illustrate, and I've only tested on pages with single videos, such as https://bunkrr.su/v/Dramatic-Look-[y8Kyi0WNg40]-psx6dJa6.webm a test upload I made that's SFW):

# getToken returns an HS256 JWT token that lasts for an hour (decode with https://jwt.io/ for expiration):
curl -X POST -H 'Content-Type: application/json' https://api-v2.bunkrr.su/getToken
{"token":"eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VySWQiOiJ1c2VyLWlkIiwiaWF0IjoxNjkxMzU3MTI2LCJleHAiOjE2OTEzNjA3MjZ9.Zg6rNqhabkmF2yqfRZ6psFVgdyek6YjiE-GQh8W9SEY"}

# One-liner to store token in $TOKEN
export TOKEN=$(curl -X POST -H 'Content-Type: application/json' https://api-v2.bunkrr.su/getToken | jq -r .token)

# Make sure to URL encode your filename, as brackets can make your shell act up
FILENAME=Dramatic-Look-%5By8Kyi0WNg40%5D-psx6dJa6.webm

# with this token and knowing the URL of the page (https://bunkrr.su/v/Dramatic-Look-%5By8Kyi0WNg40%5D-psx6dJa6.webm) we can successfully download the video
curl -L -H "Referer: https://bunkrr.su/v/$FILENAME" "https://api-v2.bunkrr.su/getFile?file_name=$FILENAME&tkn=$TOKEN" -o output.webm

Note that the final API call to https://api-v2.bunkrr.su/getFile checks the Referer and uses a redirect to go to the appropriate file, so they need to be followed with -L in curl

Now to see if pictures and albums change anything.

Edit: Looks like albums with images still download appropriately. Did only video files break?

Jules-WinnfieldX commented 1 year ago

What I do not understand is..bunkrr becomes basically useless for bulk upload/downloads if they don’t allow scrapers that get download 100 small videos in a row. It’s not like we’re downloading hundreds of videos manually,

I won't trash talk bunkr. But it's more a game of cat and mouse. They have an issue with bot users, we want an easy way to download things. It just is what it is really.

Jules-WinnfieldX commented 1 year ago

Note that the final API call to https://api-v2.bunkrr.su/getFile checks the Referer and uses a redirect to go to the appropriate file, so they need to be followed with -L in curl Now to see if pictures and albums change anything.

I have essentially this in my local test branch. Only video files have changed. The issue is Majority of the requests I make end up as HTTP 503 responses. Which is an issue on Bunkrr's side. I need to play around with it more. Another issue here is using download links such as this will entirely break the current methodology for download tracking/history. That to me is honestly the larger issue than switching everything to the API off the cuff.

I don't know really how I'm going to be able to go about switching that. The current methodology uses the download links url path as the primary key (boiling down for simplicity). For those that don't know what that is, for this issue, the URL path would be /Jules-WinnfieldX/CyberDropDownloader/issues/536

On an API download link for bunkr videos right now, it'd be /getFile. The path doesn't include any of the queries, and as a second issue beyond that, the token query should never be considered static, it'll always change. So current history form, a single downloaded video will mark any potentially download as already complete.

Jules-WinnfieldX commented 1 year ago

If it was just swapping over to the API I'd have had it done this morning, 6 hours ago. But I need to try and wrap my head around how to make this work with the current state of the program.

After I get it working (assuming I can) I'll be focusing on the re-write, which will involve fixing SC using cookie capture. It'll also mean I need to create a better DB for history tracking, while also trying to maintain backwards compatibility with the old history, or at the very least creating some method to convert it over on initial run.

StashPRs commented 1 year ago

The issue is Majority of the requests I make end up as HTTP 503 responses

Ouch, I haven't played with it enough to see these yet.

On an API download link for bunkr videos right now, it'd be /getFile. The path doesn't include any of the queries, and as a second issue beyond that, the token query should never be considered static, it'll always change. So current history form, a single downloaded video will mark any potentially download as already complete.

The request to /getFile?file_name=Dramatic-Look-%5By8Kyi0WNg40%5D-psx6dJa6.webm&tkn=blahblah redirects to https://media-files11.bunkr.ru/Dramatic-Look-[y8Kyi0WNg40]-psx6dJa6.webm, so if the crawler code returns the redirected URL as opposed to the /getFile url, could this still be functional? (Assuming the hostname change to media-files11.bunkr.ru doesn't affect anything)

Jules-WinnfieldX commented 1 year ago

The issue is Majority of the requests I make end up as HTTP 503 responses

Ouch, I haven't played with it enough to see these yet.

On an API download link for bunkr videos right now, it'd be /getFile. The path doesn't include any of the queries, and as a second issue beyond that, the token query should never be considered static, it'll always change. So current history form, a single downloaded video will mark any potentially download as already complete.

The request to /getFile?file_name=Dramatic-Look-%5By8Kyi0WNg40%5D-psx6dJa6.webm&tkn=blahblah redirects to https://media-files11.bunkr.ru/Dramatic-Look-[y8Kyi0WNg40]-psx6dJa6.webm, so if the crawler code returns the redirect URL as opposed to the /getFile url, could this still be functional?

Ummm. Actually. If that's true, I can make a head request to the getFile api endpoint and just pull the response url. That would fix basically everything I'm talking about. I didn't think about that, or pay enough attention to see if it redirected.

I'm technically sitting on a beach right now typing from my phone though, so I'll have to do that later tonight and push out an update if I can get everything to work.

StashPRs commented 1 year ago

I'm technically sitting on a beach right now typing from my phone though

Enjoy your time :)

Jules-WinnfieldX commented 1 year ago

It could be a lot cleaner, but 4.2.174 is going up now (when the bug strikes I suppose).

Edit: this should fix bunkr

StashPRs commented 1 year ago

Edit: this should fix bunkr

Can confirm with a quick test. Looks like in a bigger batch of URLs, I'm getting 429s (Too Many Requests) on the request to getFile, so I suppose Bunkrr is rate-limiting pretty heavily here.

Edit: Yep, seeing the 503s now too. Thanks, Bunkrr :)

:stuck_out_tongue_winking_eye: It sounds like you didn't

Enjoy your time :)

Thanks for the fix!

StashPRs commented 1 year ago

I noticed after logging the head requests that some were going out with the same token, and those tended to be the ones that give 429s (though I've not explored enough to tell if that's the sole cause). Turns out the getToken route will sometimes return the exact same token as a previous call to it. [Theory] This seems to occur if the getToken call is made within the same second as the previous one

roweger commented 1 year ago

It could be a lot cleaner, but 4.2.174 is going up now (when the bug strikes I suppose).

Edit: this should fix bunkr

Yep currently getting "error" from bunkr as well

Jules-WinnfieldX commented 1 year ago

Alrighty. Back to playing around with ratelimits...

Jules-WinnfieldX commented 1 year ago

and those tended to be the ones that give 429s

Same tokens should be expected for some, as there's no caching it, would be pointless if your theory is correct on per second. That being said though, I just ran through scraping around 250 videos and I haven't gotten a single 429.

You may need to provide me a url list to try.

Edit: 503 status' should be expected, and is the only thing I've hit.

Jules-WinnfieldX commented 1 year ago

I am getting timeout errors though for getToken. Trying to deal with that.

StashPRs commented 1 year ago

You may need to provide me a url list to try.

Should note I also get some 404s (just like the 503s, where the URL works on a different attempt) :shrug: and random timeouts from getToken that I can't seem to replicate in a shell with curl. So the whole shebang with my url list haha.

I base64 encoded the URLs so I don't have naughty links clickable in GitHub: aHR0cHM6Ly9idW5rcnIuc3UvYS91ZFhwZkV4NApodHRwczovL2J1bmtyci5zdS9hLzd4dVdycnFZCmh0dHBzOi8vYnVua3JyLnN1L2EvSHZ0dHlsd1UKaHR0cHM6Ly9idW5rcnIuc3UvYS9wSzdEeXh4NQpodHRwczovL2J1bmtyci5zdS9hL3FkT0FjTmc4Cmh0dHBzOi8vYnVua3JyLnN1L2EvQkwxb2ROeU0KaHR0cHM6Ly9idW5rcnIuc3UvYS96Z1hpTzdiVgpodHRwczovL2J1bmtyci5zdS9hL3lpOHBqWlJtCmh0dHBzOi8vYnVua3JyLnN1L2EvS1djTkd2ZWMKaHR0cHM6Ly9idW5rcnIuc3UvYS81MVR0bTBoRgpodHRwczovL2J1bmtyci5zdS9hL0VtTzFYcmpLCmh0dHBzOi8vYnVua3JyLnN1L2EvenB2YzBINjg=

excerpt from downloader.log for a 429 (while scraping album aHR0cHM6Ly9idW5rcnIuc3UvYS9CTDFvZE55TQ==)

2023-08-06 18:51:59,446:DEBUG:Bunkr_Spider:Bunkr_Spider.py:241:Error encountered while handling aHR0cHM6Ly9idW5rcnIuc3Uvdi8waDlzcjFoOWt6bWJqOXlvbWVvNnRfNzIwcC1FZjFIZzN5WC5tcDQ
Traceback (most recent call last):
  File "/home/Cyberdrop_DL.V4/venv/lib/python3.11/site-packages/cyberdrop_dl/crawlers/Bunkr_Spider.py", line 236, in get_album
    media = await self.get_video(session, referer)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Cyberdrop_DL.V4/venv/lib/python3.11/site-packages/cyberdrop_dl/crawlers/Bunkr_Spider.py", line 134, in get_video
    headers_resp, link_resp = await session.head(link, {"Referer": str(url)})
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Cyberdrop_DL.V4/venv/lib/python3.11/site-packages/cyberdrop_dl/client/client.py", line 39, in wrapper
    return await func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Cyberdrop_DL.V4/venv/lib/python3.11/site-packages/cyberdrop_dl/client/client.py", line 129, in head
    async with self.client_session.head(url, headers=headers, ssl=self.client.ssl_context, allow_redirects=allow_redirects) as response:
  File "/home/Cyberdrop_DL.V4/venv/lib/python3.11/site-packages/aiohttp/client.py", line 1141, in __aenter__
    self._resp = await self._coro
                 ^^^^^^^^^^^^^^^^
  File "/home/Cyberdrop_DL.V4/venv/lib/python3.11/site-packages/aiohttp/client.py", line 643, in _request
    resp.raise_for_status()
  File "/home/Cyberdrop_DL.V4/venv/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 429, message='Too Many Requests', url=URL('aHR0cHM6Ly9jMTAuYnVua3IucnUvMGg5c3IxaDlrem1iajl5b21lbzZ0XzcyMHAtRWYxSGczeVgubXA0')
2023-08-06 18:51:59,447:DEBUG:base_functions:base_functions.py:60:Error: aHR0cHM6Ly9idW5rcnIuc3Uvdi8waDlzcjFoOWt6bWJqOXlvbWVvNnRfNzIwcC1FZjFIZzN5WC5tcDQ=
2023-08-06 18:51:59,447:DEBUG:base_functions:base_functions.py:180:429, message='Too Many Requests', url=URL('aHR0cHM6Ly9jMTAuYnVua3IucnUvMGg5c3IxaDlrem1iajl5b21lbzZ0XzcyMHAtRWYxSGczeVgubXA0')

The 404 (album: aHR0cHM6Ly9idW5rcnIuc3UvYS83eHVXcnJxWQ==):

2023-08-06 18:51:59,442:DEBUG:Bunkr_Spider:Bunkr_Spider.py:241:Error encountered while handling aHR0cHM6Ly9idW5rcnIuc3Uvdi9UaGF0Q2F0LSgxKS1iV3QyeXdHTi5tcDQ=
Traceback (most recent call last):
  File "/home/Cyberdrop_DL.V4/venv/lib/python3.11/site-packages/cyberdrop_dl/crawlers/Bunkr_Spider.py", line 236, in get_album
    media = await self.get_video(session, referer)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Cyberdrop_DL.V4/venv/lib/python3.11/site-packages/cyberdrop_dl/crawlers/Bunkr_Spider.py", line 134, in get_video
    headers_resp, link_resp = await session.head(link, {"Referer": str(url)})
                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Cyberdrop_DL.V4/venv/lib/python3.11/site-packages/cyberdrop_dl/client/client.py", line 39, in wrapper
    return await func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/Cyberdrop_DL.V4/venv/lib/python3.11/site-packages/cyberdrop_dl/client/client.py", line 129, in head
    async with self.client_session.head(url, headers=headers, ssl=self.client.ssl_context, allow_redirects=allow_redirects) as response:
  File "/home/Cyberdrop_DL.V4/venv/lib/python3.11/site-packages/aiohttp/client.py", line 1141, in __aenter__
    self._resp = await self._coro
                 ^^^^^^^^^^^^^^^^
  File "/home/Cyberdrop_DL.V4/venv/lib/python3.11/site-packages/aiohttp/client.py", line 643, in _request
    resp.raise_for_status()
  File "/home/Cyberdrop_DL.V4/venv/lib/python3.11/site-packages/aiohttp/client_reqrep.py", line 1005, in raise_for_status
    raise ClientResponseError(
aiohttp.client_exceptions.ClientResponseError: 404, message='Not Found', url=URL('aHR0cHM6Ly9jOS5idW5rci5ydS9UaGF0Q2F0LSgxKS1iV3QyeXdHTi5tcDQ=')
2023-08-06 18:51:59,443:DEBUG:base_functions:base_functions.py:60:Error: aHR0cHM6Ly9idW5rcnIuc3Uvdi9UaGF0Q2F0LSgxKS1iV3QyeXdHTi5tcDQ=
2023-08-06 18:51:59,443:DEBUG:base_functions:base_functions.py:180:404, message='Not Found', url=URL('aHR0cHM6Ly9jOS5idW5rci5ydS9UaGF0Q2F0LSgxKS1iV3QyeXdHTi5tcDQ=')

Jules-WinnfieldX commented 1 year ago

this get token method is making me mad. It's so inconsistent.

Jules-WinnfieldX commented 1 year ago

Pushing 4.2.175

The 404 is happening because bunkr is redirecting to a c.bunkrr domain instead of a media-files.bunkrr domain. It's a bunkrr issue essentially.

This version is going to be a lot slower for bunkrr scrapes. I've limited the getToken and head requests to 1 req/s

Jules-WinnfieldX commented 1 year ago

I'm considering this fixed with 4.2.177

See here on how to get the new value requested in the config/cli. https://github.com/Jules-WinnfieldX/CyberDropDownloader/wiki/Frequently-Asked-Questions#how-do-i-get-the-xf_session-cookie-value-for-simpcity-forum-scraping

ClonedBoy commented 1 year ago

Hi, thanks for this great program, has saved me multiple hours!

I am getting Errored scraping messages with SimpCity when using Brave (maybe this happens with other Chromium-based browsers?). I have tried the xf_session cookie value alone, with ' and ". Also tried different cookie values (logged out and logged back in multiple times).

When the value started with -, I got a message stating that the argument was invalid.

When the value has - or _ in the string, the programs seems to not capture the correct cookie value, thus the error scraping message shows.

But when using Firefox, the cookie value generated was alphanumeric. I was able to get it working with the xf_session value from Firefox and not from Brave. The times I tried a new cookie in Brave, I didn't get it to be only alphanumeric. Maybe luck?

Jules-WinnfieldX / CyberDropDownloader

[BUG] Simpcity added captcha to login, breaking SC scraping #536