fake-name / xA-Scraper

69 stars 8 forks source link

Cloudflare protection breaks cookies #52

Closed God-damnit-all closed 5 years ago

God-damnit-all commented 5 years ago

This might not be fixable, but I notice every time I'm running fAGet it has to go through the recaptcha process. I have't had to constantly log back into the site on my browser, so I'm not sure why xA-Scraper needs to do a new login every time. That may just be a limitation of the automation technology though, I don't know.

herp-a-derp commented 5 years ago

This is basically expected, as the scraper generates a random user agent on every execution. CF cookies are bound to a specific UA, so every UA permutation requires a new CF cookie.

It shouldn't kill your login, though. I guess FA is binding their logins to a UA as well. That's annoying.

God-damnit-all commented 5 years ago

For what it's worth, just about everything is working right now. Just thought you should know, thanks!

herp-a-derp commented 5 years ago

I mean, Pixiv is broken and patreon login fails because of CloudShit, but good, I guess?

God-damnit-all commented 5 years ago

Wait, why was this closed? I think this issue is still relevant, I'd rather it not have to do a new recaptcha every time it runs, that costs money.

Also, regarding Pixiv, Have you thought about using some of PixivUtil2 as a dependency? It's also coded in Python.

herp-a-derp commented 5 years ago

Wait, why was this closed? I think this issue is still relevant, I'd rather it not have to do a new recaptcha every time it runs, that costs money.

How often are you running the thing? The scheduler fires maybe 3 times a week, and that's what, $0.006?

Also, regarding Pixiv, Have you thought about using some of PixivUtil2 as a dependency? It's also coded in Python.

That's.... really interesting, actually.

https://github.com/Nandaka/PixivUtil2/issues/506

Sadface.

God-damnit-all commented 5 years ago

How often are you running the thing? The scheduler fires maybe 3 times a week, and that's what, $0.006?

I don't use the built-in scheduler. The web interface is wildly inaccurate (moreso when I was using SQLite but it's still not terribly accurate on PostgreSQL) so I only ever start it up when I want to add users to the monitored names list.

And with normal use, yes, it does not cost that much. When I am trying to troubleshoot something, however, it adds up. (There has been a persistent issue with certain swfs but I don't have any PRs to offer on that front yet.) Plus it takes the troubleshooting process take longer since it has to go through the recaptcha every single time.

And I just realized, since there is no proxy support, recaptcha is also probably taking notice the nonstandard activity from my IP address that xA-Scraper causes. That can lead to recaptcha thinking I'm not human much more often when I'm browsing.

The only reason I have a 2captcha API key in the first place was because for the longest time I could not convince recaptcha I was human and had to click on busses every time I came across one, The irony is I never ran any sort of bots back then. I'd hate to have to start relying on the buggy recaptcha solver extension again.

God-damnit-all commented 5 years ago

Sadface.

Ah.. I didn't realize it was still on Python 2, it's always made use of the embedded version of Python. Well, hopefully a conversion to Python 3 is in the works since Python 2 EOL is coming up.

herp-a-derp commented 5 years ago

I don't use the built-in scheduler. The web interface is wildly inaccurate (moreso when I was using SQLite but it's still not terribly accurate on PostgreSQL) so I only ever start it up when I want to add users to the monitored names list.

The intended design is for the system to just be left running continuously.

And with normal use, yes, it does not cost that much. When I am trying to debug something, however, it adds up. (There has been a persistent issue with certain swfs but I don't have any PRs to offer on that front yet.) Plus it takes the debugging process take longer since it has to go through the recaptcha every single time.

And I just realized, since there is no proxy support, recaptcha is also probably taking notice the nonstandard activity from my IP address that xA-Scraper causes. That can lead to recaptcha thinking I'm not human much more often when I'm browsing.

Have you considered a cheap VPS? You can rent a box for ~$5 a month and run everything there.

Also, you could probably just call random.seed() with a fixed value as the first thing in your script, and you'd get a deterministic UA.

Ah.. I didn't realize it was still on Python 2, it's always made use of the embedded version of Python. Well, hopefully a conversion to Python 3 is in the works since Python 2 EOL is coming up.

I'm strongly considering forking it and doing a port. It doesn't look like that much work.

God-damnit-all commented 5 years ago

The intended design is for the system to just be left running continuously.

To be fair, it also seems like the intended design is for it to not need to discard the cookies every time.

Have you considered a cheap VPS? You can rent a box for ~$5 a month and run everything there.

I have an unlimited bandwidth gigabit connection and more storage than you can shake a stick at, so if I were wanting to go to that amount of trouble, I'd just run xA-Scraper off an Ubuntu Hyper-V installation. But considering most of the problems I've had have not been Windows-specific, I'm not even sure what brings this up - maybe the ReCaptcha stuff? But that was only really a tertiary gripe.

Also, you could probably just call random.seed() with a fixed value as the first thing in your script, and you'd get a deterministic UA.

ReCaptcha does not like it when you deprive it of its precious telemetry, but you probably don't need me to tell you that.

I'm strongly considering forking it and doing a port. It doesn't look like that much work.

That's good to hear.

fake-name commented 5 years ago

To be fair, it also seems like the intended design is for it to not need to discard the cookies every time.

It doesn't discard any cookies, it just has a non-constant user agent.

Frankly, vendors (including google) pinning shit to the UA is a antipattern, and it's fucking wrong and I refuse to support it. The whole UA itself (and associated feature sniffing, etc..) is total garbage, and modern web is fundamentally broken, but that's a rant for another time.

ReCaptcha does not like it when you deprive it of its precious telemetry, but you probably don't need me to tell you that.

No, I'm saying by seeding the python RNG, you'd get the same UA every time, so it shouldn't require a captcha again.

WebRequest generates it's UA by using python random.<func> calls. If you make those deterministic, it'll always use the same UA.

That's good to hear.

Heh: https://github.com/Nandaka/PixivUtil2/pull/532

fake-name commented 5 years ago

I wound up using PixivPy. It's a hell of a lot less comprehensive, but it just worked.

God-damnit-all commented 5 years ago

I wound up using PixivPy. It's a hell of a lot less comprehensive, but it just worked.

That doesn't necessarily mean you can't end up using PixivUtil2 as well. It doesn't look like PixivPy is able to do anything special with the compatibility-challenged ugoira format, for instance.

fake-name commented 5 years ago

the compatibility-challenged ugoira format, for instance.

Well, it's just a zip folder full of images, so it seems pretty straightforward. I'm not converting it to another format at the moment, though.

God-damnit-all commented 5 years ago

Well, it's just a zip folder full of images, so it seems pretty straightforward. I'm not converting it to another format at the moment, though.

I wouldn't call tying essential animation data to the metadata straightforward. I think converting to apng should be a given, since it's lossless and tends to have a better filesize compared to other formats when they're configured to be lossless.

fake-name commented 5 years ago

I think I'm saving the metadata, and if not I can make it get saved easily.

God-damnit-all commented 5 years ago

I think I'm saving the metadata, and if not I can make it get saved easily.

Where's that metadata going to be though, the database? Hell, even if it was saved in plain text, you'd still have a hard time getting it into the right format without programming something or doing it manually.

God-damnit-all commented 5 years ago

The only flaws with the apng container that I know of is it tends to be large (though only slightly larger than the equivalent gif) and it's not as widely supported as other formats. But because it's lossless, making a more-compatible, lossy file from it is no sweat.

Annoyingly PixivUtil2 gives it the .png extension instead of the .apng extension, which is valid, but makes them harder to identify.

fake-name commented 5 years ago

Where's that metadata going to be though, the database?

Yep. Basically the idea is later on I can write something that does format conversion on the already downloaded file. The goal is to not have to re-download anything for the conversion.

God-damnit-all commented 5 years ago

Any chance I could convince you to save it into a .metadata file in the zip instead? I really feel such essential information about how the file should be read should be kept, well, with the file.

fake-name commented 5 years ago

Considering how much other metadata is already only stored in the DB (item tags, description, title, etc...) I don't really see the point.

Also, laaaaazy, and I'll probably put together a apng thing in the nearish future anyways.

God-damnit-all commented 5 years ago

The problem is that file instructions shouldn't be metadata in the first place, which is part of what makes the format so infuriating to me.