AlexCSDev / PatreonDownloader

Powerful tool for downloading content posted by creators on patreon.com. Supports content hosted on patreon itself as well as external sites (additional plugins might be required).
MIT License
931 stars 96 forks source link

Cookie, mux.com and datadome issues #125

Closed ReysukeBaka closed 2 years ago

ReysukeBaka commented 2 years ago

Hey, getting an Error recently any idea how to fix it?

2022-05-14 13:52:50.0560 DEBUG [PatreonDownloader.Implementation.PatreonPageCrawler] Page #4: https://www.patreon.com/api/posts?include=user%2Cattachments%2Ccampaign%2Cpoll.choices%2Cpoll.current_user_responses.user%2Cpoll.current_user_responses.choice%2Cpoll.current_user_responses.poll%2Caccess_rules.tier.null%2Cimages.null%2Caudio.null&fields%5Bpost%5D=change_visibility_at%2Ccomment_count%2Ccontent%2Ccurrent_user_can_delete%2Ccurrent_user_can_view%2Ccurrent_user_has_liked%2Cembed%2Cimage%2Cis_paid%2Clike_count%2Cmin_cents_pledged_to_view%2Cpost_file%2Cpost_metadata%2Cpublished_at%2Cpatron_count%2Cpatreon_url%2Cpost_type%2Cpledge_url%2Cthumbnail_url%2Cteaser_text%2Ctitle%2Cupgrade_url%2Curl%2Cwas_posted_by_campaign_owner&fields%5Buser%5D=image_url%2Cfull_name%2Curl&fields%5Bcampaign%5D=show_audio_post_download_links%2Cavatar_photo_url%2Cearnings_visibility%2Cis_nsfw%2Cis_monthly%2Cname%2Curl&fields%5Baccess_rule%5D=access_rule_type%2Camount_cents&fields%5Bmedia%5D=id%2Cimage_urls%2Cdownload_url%2Cmetadata%2Cfile_name&sort=-published_at&filter%5Bis_draft%5D=false&filter%5Bcontains_exclusive_posts%5D=true&json-api-use-default-includes=false&json-api-version=1.0&filter%5Bcampaign_id%5D=3133042&page%5Bcursor%5D=01SUSjbQm6uGXMGHMnHbaLxrQ_ 2022-05-14 13:52:50.3300 FATAL [PatreonDownloader.App.Program] Fatal error, application will be closed: UniversalDownloaderPlatform.Common.Exceptions.DownloadException: Error status code returned: BadRequest at UniversalDownloaderPlatform.DefaultImplementations.WebDownloader.DownloadStringInternal(String url, Int32 retry, Int32 retryTooManyRequests) in F:\Sources\BigProjects\PatreonDownloader\submodules\UniversalDownloaderPlatform\UniversalDownloaderPlatform.DefaultImplementations\WebDownloader.cs:line 323 at UniversalDownloaderPlatform.DefaultImplementations.WebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\submodules\UniversalDownloaderPlatform\UniversalDownloaderPlatform.DefaultImplementations\WebDownloader.cs:line 288 at PatreonDownloader.Implementation.PatreonWebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 55 at PatreonDownloader.Implementation.PatreonWebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 73 at PatreonDownloader.Implementation.PatreonPageCrawler.Crawl(ICrawlTargetInfo crawlTargetInfo, String downloadDirectory) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonPageCrawler.cs:line 84 at UniversalDownloaderPlatform.Engine.UniversalDownloader.Download(String url, String downloadDirectory, IUniversalDownloaderPlatformSettings settings) in F:\Sources\BigProjects\PatreonDownloader\submodules\UniversalDownloaderPlatform\UniversalDownloaderPlatform.Engine\UniversalDownloader.cs:line 198 at PatreonDownloader.App.Program.RunPatreonDownloader(CommandLineOptions commandLineOptions) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.App\Program.cs:line 143 at PatreonDownloader.App.Program.Main(String[] args) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.App\Program.cs:line 69

vincinuge commented 2 years ago

Yeah, I've been getting the same thing. Only on certain artists though. Not sure if the fact that the artist has polls on it affects it in anyway, as the only artists that have worked don't have polls on the page.

ReysukeBaka commented 2 years ago

Yeah, works on some But even have an artist without polls where it just wont work

also a bunch of files failing to get an ID

Spyridion commented 2 years ago

I've also noticed this happening as well. I think I'm seeing it for all artists so far.

AlexCSDev commented 2 years ago

Please try installing Cloudflare WARP. Make sure "1.1.1.1 with WARP" mode is enabled.

vincinuge commented 2 years ago

Hmm, that still doesn't seem to solve the problem.

vincinuge commented 2 years ago

For the artists, its finding all the posts, receiving errors when encountering polls, and then it gives me a fatal error.

2022-05-22 11:53:17.8149 FATAL Fatal error, application will be closed: UniversalDownloaderPlatform.Common.Exceptions.DownloadException: Error status code returned: BadRequest

AlexCSDev commented 2 years ago

I've made a test build which will write error details into the "debug" folder. Please send the contents of this folder to alexcsdev@protonmail.com or post them here.

https://mega.nz/file/TgkHwKCR#ZXN4Qw30tjIHqaCa0PxxnQeSVdAGDb7jC3XPUqDQedY

vincinuge commented 2 years ago

Contents of debug folder:

{"errors":[{"code":3,"code_name":"ParameterInvalid","detail":"Invalid parameter for 'page[cursor]': Invalid or expired cursor.","id":"5e6ab4af-cefb-5c3b-be8f-69d9cb642b50","source":{"parameter":"page[cursor]"},"status":"400","title":"Invalid value for parameter 'page[cursor]'."}]}

AlexCSDev commented 2 years ago

Curious error. Run the app with the --verbose option and upload the latest log file from the logs folder to the https://pastebin.com/. It will contain the creator name, so if you don't want to post it publicly you can send the file to my email (alexcsdev@protonmail.com) instead.

vincinuge commented 2 years ago

Log file has been sent to your email.

AlexCSDev commented 2 years ago

Ok, the ERROR [PatreonDownloader.Implementation.PatreonPageCrawler] Verification for XXXXXX: Unknown type for "included": poll thing can be ignored, this is not related to the issue you guys are having.

I honestly don't know what is going on here. Every single user who sent me their logs has the same issue with the page cursor being invalid or expired, but I haven't experienced this issue myself even on the same creator as vincinuge used.

Maybe this is some kind of internet provider issue? Can you guys share which ISP you are using?

vincinuge commented 2 years ago

Verizon

vincinuge commented 2 years ago

What ISP are you using?

AlexCSDev commented 2 years ago

My current place of living makes my ISP information useless for 99% of the people who are using this app.

ReysukeBaka commented 2 years ago

Telekom in Germany

duracell commented 2 years ago

Same problem here with Telekom and different DNS services (even 1.1.1.1).

SubbyDew commented 2 years ago

Aussie Broadband in Australia, also using 1.1.1.1

Spyridion commented 2 years ago

Hi, I tried using your test build but I hit a different sort of error. I noticed before this the program would cycle between killing the chrome processes and saying opening browser for the captcha but not open anything.


 ---> System.ComponentModel.Win32Exception (299): Only part of a ReadProcessMemory or WriteProcessMemory request was completed.
   at System.Diagnostics.NtProcessManager.EnumProcessModulesUntilSuccess(SafeProcessHandle processHandle, IntPtr[] modules, Int32 size, Int32& needed)
   at System.Diagnostics.NtProcessManager.GetModules(Int32 processId, Boolean firstModuleOnly)
   at System.Diagnostics.NtProcessManager.GetFirstModule(Int32 processId)
   at System.Diagnostics.Process.get_MainModule()
   at PatreonDownloader.PuppeteerEngine.PuppeteerEngine.<>c.<KillChromeIfRunning>b__10_0(Process x) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.PuppeteerEngine\PuppeteerEngine.cs:line 75
   at System.Linq.Enumerable.WhereArrayIterator`1.ToArray()
   at PatreonDownloader.PuppeteerEngine.PuppeteerEngine.KillChromeIfRunning() in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.PuppeteerEngine\PuppeteerEngine.cs:line 74
   at PatreonDownloader.PuppeteerEngine.PuppeteerEngine.Initialize(Uri remoteBrowserAddress, Boolean headless) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.PuppeteerEngine\PuppeteerEngine.cs:line 63
   at PatreonDownloader.PuppeteerEngine.PuppeteerEngine..ctor(Boolean headless) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.PuppeteerEngine\PuppeteerEngine.cs:line 48
   at PatreonDownloader.PuppeteerEngine.PuppeteerCaptchaSolver..ctor() in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.PuppeteerEngine\PuppeteerCaptchaSolver.cs:line 24
   at PatreonDownloader.Implementation.PatreonWebDownloader.SolveCaptchaAndUpdateCookies(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 82
   at PatreonDownloader.Implementation.PatreonWebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 63
   at PatreonDownloader.Implementation.PatreonWebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 66
   at PatreonDownloader.Implementation.PatreonWebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 66
   at PatreonDownloader.Implementation.PatreonWebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 66
   at PatreonDownloader.Implementation.PatreonWebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 66
   at PatreonDownloader.Implementation.PatreonWebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 66
   at PatreonDownloader.Implementation.PatreonWebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 66
   at PatreonDownloader.Implementation.PatreonCrawlTargetInfoRetriever.GetCampaignId(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonCrawlTargetInfoRetriever.cs:line 36
   --- End of inner exception stack trace ---
   at PatreonDownloader.Implementation.PatreonCrawlTargetInfoRetriever.GetCampaignId(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonCrawlTargetInfoRetriever.cs:line 49
   at PatreonDownloader.Implementation.PatreonCrawlTargetInfoRetriever.RetrieveCrawlTargetInfo(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonCrawlTargetInfoRetriever.cs:line 24
   at UniversalDownloaderPlatform.Engine.UniversalDownloader.Download(String url, String downloadDirectory, IUniversalDownloaderPlatformSettings settings) in F:\Sources\BigProjects\PatreonDownloader\submodules\UniversalDownloaderPlatform\UniversalDownloaderPlatform.Engine\UniversalDownloader.cs:line 176
   at PatreonDownloader.App.Program.RunPatreonDownloader(CommandLineOptions commandLineOptions) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.App\Program.cs:line 143
   at PatreonDownloader.App.Program.Main(String[] args) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.App\Program.cs:line 69```
AlexCSDev commented 2 years ago

Please try version 0.10.3.0. I have improved browser mimicking in that version.

@Spyridion "ReadProcessMemory or WriteProcessMemory" issue is being tracked here https://github.com/AlexCSDev/PatreonDownloader/issues/123

SubbyDew commented 2 years ago

Just tried with 0.10.3.0 and got this FATAL error partway through the crawl:

2022-06-01 20:52:31.8821 FATAL Fatal error, application will be closed: UniversalDownloaderPlatform.Common.Exceptions.DownloadException: Error status code returned: BadRequest at UniversalDownloaderPlatform.DefaultImplementations.WebDownloader.DownloadStringInternal(String url, Int32 retry, Int32 retryTooManyRequests) in F:\Sources\BigProjects\PatreonDownloader\submodules\UniversalDownloaderPlatform\UniversalDownloaderPlatform.DefaultImplementations\WebDownloader.cs:line 333 at UniversalDownloaderPlatform.DefaultImplementations.WebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\submodules\UniversalDownloaderPlatform\UniversalDownloaderPlatform.DefaultImplementations\WebDownloader.cs:line 292 at PatreonDownloader.Implementation.PatreonWebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 55 at PatreonDownloader.Implementation.PatreonWebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 73 at PatreonDownloader.Implementation.PatreonPageCrawler.Crawl(ICrawlTargetInfo crawlTargetInfo, String downloadDirectory) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonPageCrawler.cs:line 84 at UniversalDownloaderPlatform.Engine.UniversalDownloader.Download(String url, String downloadDirectory, IUniversalDownloaderPlatformSettings settings) in F:\Sources\BigProjects\PatreonDownloader\submodules\UniversalDownloaderPlatform\UniversalDownloaderPlatform.Engine\UniversalDownloader.cs:line 198 at PatreonDownloader.App.Program.RunPatreonDownloader(CommandLineOptions commandLineOptions) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.App\Program.cs:line 143 at PatreonDownloader.App.Program.Main(String[] args) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.App\Program.cs:line 69

Spyridion commented 2 years ago

Hi @AlexCSDev, Sorry I didn't realize that was part of that issue. I have tried the new version released though and I have gotten the same error as SubbyDew above me. I am using WARP too.

ReysukeBaka commented 2 years ago

Its working fine now for me. New Version pretty much fixed it for me. Getting a "Can view post" even tho i have access but thats already in another thread

AlexCSDev commented 2 years ago

Everyone who is still having this issue, please try removing chromedata directory and try again. Make sure you are running the latest version.

clocklear commented 2 years ago

I'm running the latest version. I tried removing the chromedata directory, still the same error:

2022-06-02 21:05:34.3767 FATAL Fatal error, application will be closed: UniversalDownloaderPlatform.Common.Exceptions.DownloadException: Error status code returned: BadRequest
   at UniversalDownloaderPlatform.DefaultImplementations.WebDownloader.DownloadStringInternal(String url, Int32 retry, Int32 retryTooManyRequests) in F:\Sources\BigProjects\PatreonDownloader\submodules\UniversalDownloaderPlatform\UniversalDownloaderPlatform.DefaultImplementations\WebDownloader.cs:line 333
   at UniversalDownloaderPlatform.DefaultImplementations.WebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\submodules\UniversalDownloaderPlatform\UniversalDownloaderPlatform.DefaultImplementations\WebDownloader.cs:line 292
   at PatreonDownloader.Implementation.PatreonWebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 55
   at PatreonDownloader.Implementation.PatreonWebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 73
   at PatreonDownloader.Implementation.PatreonPageCrawler.Crawl(ICrawlTargetInfo crawlTargetInfo, String downloadDirectory) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonPageCrawler.cs:line 84
   at UniversalDownloaderPlatform.Engine.UniversalDownloader.Download(String url, String downloadDirectory, IUniversalDownloaderPlatformSettings settings) in F:\Sources\BigProjects\PatreonDownloader\submodules\UniversalDownloaderPlatform\UniversalDownloaderPlatform.Engine\UniversalDownloader.cs:line 198
   at PatreonDownloader.App.Program.RunPatreonDownloader(CommandLineOptions commandLineOptions) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.App\Program.cs:line 143
   at PatreonDownloader.App.Program.Main(String[] args) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.App\Program.cs:line 69

Perhaps there is a single post type on the artist feed that the downloader doesn't like? Is it possible to add a command flag to skip problematic items instead of failing completely?

AlexCSDev commented 2 years ago

The issue is more complicated than that. The app relies on patreon itself to tell it how to access the next page with the posts. For some reason for some users the returned url is not valid. It's impossible to continue going through creator's posts after that happens.

The issue here is that I don't know why that happens and I can't reproduce it on my side.

SubbyDew commented 2 years ago

I'm also still getting the same error after deleting chromedata on the latest version:

2022-06-03 14:13:30.3387 FATAL Fatal error, application will be closed: UniversalDownloaderPlatform.Common.Exceptions.DownloadException: Error status code returned: BadRequest
   at UniversalDownloaderPlatform.DefaultImplementations.WebDownloader.DownloadStringInternal(String url, Int32 retry, Int32 retryTooManyRequests) in F:\Sources\BigProjects\PatreonDownloader\submodules\UniversalDownloaderPlatform\UniversalDownloaderPlatform.DefaultImplementations\WebDownloader.cs:line 333
   at UniversalDownloaderPlatform.DefaultImplementations.WebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\submodules\UniversalDownloaderPlatform\UniversalDownloaderPlatform.DefaultImplementations\WebDownloader.cs:line 292
   at PatreonDownloader.Implementation.PatreonWebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 55
   at PatreonDownloader.Implementation.PatreonWebDownloader.DownloadString(String url) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonWebDownloader.cs:line 73
   at PatreonDownloader.Implementation.PatreonPageCrawler.Crawl(ICrawlTargetInfo crawlTargetInfo, String downloadDirectory) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.Implementation\PatreonPageCrawler.cs:line 84
   at UniversalDownloaderPlatform.Engine.UniversalDownloader.Download(String url, String downloadDirectory, IUniversalDownloaderPlatformSettings settings) in F:\Sources\BigProjects\PatreonDownloader\submodules\UniversalDownloaderPlatform\UniversalDownloaderPlatform.Engine\UniversalDownloader.cs:line 198
   at PatreonDownloader.App.Program.RunPatreonDownloader(CommandLineOptions commandLineOptions) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.App\Program.cs:line 143
   at PatreonDownloader.App.Program.Main(String[] args) in F:\Sources\BigProjects\PatreonDownloader\PatreonDownloader.App\Program.cs:line 69
clocklear commented 2 years ago

The issue is more complicated than that. The app relies on patreon itself to tell it how to access the next page with the posts. For some reason for some users the returned url is not valid. It's impossible to continue going through creator's posts after that happens.

The issue here is that I don't know why that happens and I can't reproduce it on my side.

Got it. Gonna attempt some local debugging with my artist feed (assuming I can get the project built/running). Will report back.

clocklear commented 2 years ago

It seems that the URL returned by the Patreon API is valid; navigating to the URL in my normal browser returns a 200 with a proper JSON response, but the WebDownloader DownloadStringInternal function is getting a 400 from that URL. Wonder if there is a missing request header that the browser impersonation implementation is missing...

AlexCSDev commented 2 years ago

I don't think that's a header issue, but who knows... I will really appreciate it if you will try to figure this out, no matter what I do I can't replicate this behavior.

clocklear commented 2 years ago

It's not a header issue, per se -- you are correct. However, in attempting to debug, I seem to have gotten myself on CloudFlare's naughty list -- I can't auth any more and keep getting redirected to do captcha checks over and over again.

What I will say is that I was doing debugging with Fiddler comparing the requests from PatreonDownloader to the requests from my browser and the only (meaningful) differences were in the cookie. I think it was the case that I was missing a session_id in the PatreonDownloader request, but I had a working one on the browser side.

AlexCSDev commented 2 years ago

Aha. Yep, seems like this is an issue. My tests show that cursor id is probably tied to current user's session. But the missing cookie means the user is not logged in.

I wonder why that cookie goes missing... Does it not getting transferred from the browser at all? Or something overrides it? Really interesting issue...

clocklear commented 2 years ago

I wonder if the site is sending a Set-Cookie header on a previous request that is missing the session_id, to which the shared _httpClient happily obliges. Which causes the next request to fail because there is no session associated with the request.

I wonder if something about the activity leading up to that point is triggering some sort of CloudFlare protection which is triggering the destruction of the session.

clocklear commented 2 years ago

Success! I was able to successfully scrape the contents of my target feed, though I cannot say for sure which of these items caused it to work:

EDIT: at least I thought I was good. Seems like subsequent calls to download resources have resulted in lots of 403s; the m3u8 URLs result in Forbidden.

AlexCSDev commented 2 years ago

We can test the cookie thing quite easily I think.

In DownloadStringInternal before using (var request = new HttpRequestMessage(HttpMethod.Get, url) {Version = _httpVersion}) add the following code and it will dump all of the cookies present before the request is made. This will allow us to see if first request receives session_id cookie at all and if request prior to current request removes the cookies:

                _logger.Info($"New request: {url}");
                CookieCollection cookies = _httpClientHandler.CookieContainer.GetCookies(new Uri("https://patreon.com"));

                foreach (Cookie cookie in cookies)
                {
                    _logger.Info($"Cookie: {cookie.Name}={cookie.Value}");
                }
clocklear commented 2 years ago

Well, now I'm puzzled. I removed my sleep and added the cookie logging bits above. The process completed with no issues, and I see session_id present on every call. I still have to use my pre-auth'ed remote browser session because I can't get through the captcha in the headless browser. I'll keep trying, possibly later this weekend.

AlexCSDev commented 2 years ago

Alright, thank you!

vincinuge commented 2 years ago

Thanks for your hard work guys. If this bug gets fixed, its going to save me hours of effort in my archiving process.

clocklear commented 2 years ago

Doing some more testing this morning. Sufficient time has passed such that I am no longer on the CloudFlare naughty list and the headless browser is allowing me to authenticate properly. However, I've just witnessed the 'session ID is missing' problem. Fortunately, I've got the logs to prove it:

Here's the request to the first page of API results: image

And here's the request to load the second page: image

Note there is no session ID in the second result.

I think there may be a race condition, given the async/await nature of the program that may be causing the cookie to get clobbered by making a new request before the first one is completely done? I modified the code to 'save' the session_id cookie if it was ever found and I am now injecting it manually to the CookieContainer on subsequent requests if it is missing -- this causes the API scrapes to work as expected!

However, I'm now getting Forbidden when the downloader attempts to download the content that may be embedded in the post. For me, this is a link to an external m3u8 file hosted on stream.mux.com; I wonder if the same thing is happening there (where the downloader should be using session_id but it is getting lost somehow).

I will keep testing...

AlexCSDev commented 2 years ago

Hm... I will take a look a bit later why this might be happening. Page parsing should be single threaded, so there shouldn't be any kind of race condition there.

As for the stream.mux.com thing - embedded audio/video content is not something I have tested or explicitly implemented, so I'm not sure what is needed for it to work properly. If I were to guess they might be checking if the origin and/or referer is set to patreon.com? Assuming this is functionality built into the patreon of course.

AlexCSDev commented 2 years ago

@clocklear Can I also ask you to do one more thing? I want to see the headers returned by the server in those requests.

  1. Disable your cookie fix
  2. In DownloadStringInternal before if (!responseMessage.IsSuccessStatusCode) add the following code:
                        _logger.Info("Response headers:");
                        foreach (string headerString in responseMessage.Headers
                                     .ToString()
                                     .Split(new[] { "\r\n" }, StringSplitOptions.RemoveEmptyEntries))
                        {
                            _logger.Info(headerString);
                        }
clocklear commented 2 years ago

@AlexCSDev here's the response headers on the last request that succeeded. The very next request logs out the CookieContainer contents before the request is sent and it does not contain session_id. Doesn't look like anything in this response is explicitly requesting the removal of the session though.

Response headers:
Date: Mon, 06 Jun 2022 17:59:10 GMT
cf-ray: 71730a4c4b95588a-IAD
Cache-Control: private
Set-Cookie: datadome=u6IHnZmWnRY9kJPvq8Jr_JVAnChBV.r8fOD4.J7ro-8upz8pM4idxVzPalZiRdOgWYs5GwVHdA5ymH4RXZ3-kyuRifwJUnlkrzwByqelU3YqNRBO~Z-rqUXIvJtJ3jO; Max-Age=31536000; Domain=.patreon.com; Path=/; Secure; SameSite=Lax
Strict-Transport-Security: max-age=2592000
cf-cache-status: DYNAMIC
accept-ch: Sec-CH-UA,Sec-CH-UA-Mobile,Sec-CH-UA-Platform,Sec-CH-UA-Arch,Sec-CH-UA-Full-Version-List,Sec-CH-UA-Model,Sec-CH-Device-Memory
Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
X-Content-Type-Options: nosniff
x-datadome: protected
x-patreon-uuid: b8ccf179-2956-5d95-8f5e-9b0326198c18
x-protected-by: Sqreen
report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=oswafuHAvIRh1laOzbXCwA06NwIV1Qyr73cl%2Bkxzt9EjF8THYhtXN1nKPDrWn76iYvzAdVXv3Tz6ICZ72FUKWGaZX8uMax6nRl6Z8GQMdQb8B4OeuZJSLa838xOs%2BomsOQ%3D%3D"}],"group":"cf-nel","max_age":604800}
nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
Server: cloudflare

Was thinking more about the stream.mux.com thing and yeah, I agree with you -- feels like they've started doing some new referrer checking (because I've used this project for a couple months fine up to this point). I'll see if I can fake the referrer to see if it helps.

AlexCSDev commented 2 years ago

Hm... I wonder if the cookie expires or forced to be removed by its settings....

Let's try dumping complete cookie data, change the cookie printing code to this:

                _logger.Info($"New request: {url}");
                CookieCollection cookies = _httpClientHandler.CookieContainer.GetCookies(new Uri("https://patreon.com"));

                foreach (Cookie cookie in cookies)
                {
                    _logger.Info("===========");
                    _logger.Info("Cookie:");
                    _logger.Info($"{cookie.Name} = {cookie.Value}");
                    _logger.Info($"Domain: {cookie.Domain}");
                    _logger.Info($"Path: {cookie.Path}");
                    _logger.Info($"Port: {cookie.Port}");
                    _logger.Info($"Secure: {cookie.Secure}");

                    _logger.Info($"When issued: {cookie.TimeStamp}");
                    _logger.Info($"Expires: {cookie.Expires} (expired? {cookie.Expired})");
                    _logger.Info($"Don't save: {cookie.Discard}");
                    _logger.Info($"Comment: {cookie.Comment}");
                    _logger.Info($"Uri for comments: {cookie.CommentUri}");
                    _logger.Info($"Version: RFC {(cookie.Version == 1 ? 2109 : 2965)}");

                    _logger.Info($"String: {cookie}");
                }
clocklear commented 2 years ago
Cookie:
session_id = XXXXXXXXXX
Domain: .patreon.com
Path: /
Port: 
Secure: False
When issued: 6/6/2022 2:17:25 PM
Expires: 1/1/0001 12:00:00 AM (expired? False)
Don't save: False
Comment: 
Uri for comments: 
Version: RFC 2965
String: session_id=XXXXXXXXXX

Value obfuscated for obvious reasons.

FWIW, its settings don't appear to be any different than any other cookie's settings.

AlexCSDev commented 2 years ago

Ok, one last thing to try:

Replace the function with this. During normal operation it should print this:

2022-06-06 21:25:54.6217 INFO New request: xxxxxxxx
2022-06-06 21:25:54.6217 INFO Session ID exists before the request
2022-06-06 21:25:55.6046 INFO Session ID exists after requesting headers
2022-06-06 21:25:55.6046 INFO Session ID exists after requesting content
        private async Task<string> DownloadStringInternal(string url, int retry = 0, int retryTooManyRequests = 0)
        {
            if (retry > 0)
            {
                if (retry >= _maxRetries)
                {
                    throw new DownloadException("Retries limit reached");
                }

                await Task.Delay(retry * _retryMultiplier * 1000);
            }

            if (retryTooManyRequests > 0)
                await Task.Delay(retryTooManyRequests * _retryMultiplier * 1000);

            try
            {
                _logger.Info($"New request: {url}");
                CookieCollection cookies = _httpClientHandler.CookieContainer.GetCookies(new Uri("https://patreon.com"));

                foreach (Cookie cookie in cookies)
                {
                    if(cookie.Name == "session_id")
                        _logger.Info("Session ID exists before the request");
                }
                using (var request = new HttpRequestMessage(HttpMethod.Get, url) {Version = _httpVersion})
                {
                    //Add some additional headers to better mimic a real browser
                    request.Headers.Add("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8");
                    request.Headers.Add("Accept-Language", "en-US,en;q=0.5");
                    request.Headers.Add("Cache-Control", "no-cache");
                    request.Headers.Add("DNT", "1");

                    using (HttpResponseMessage responseMessage =
                        await _httpClient.SendAsync(request, HttpCompletionOption.ResponseHeadersRead))
                    {
                        cookies = _httpClientHandler.CookieContainer.GetCookies(new Uri("https://patreon.com"));

                        foreach (Cookie cookie in cookies)
                        {
                            if (cookie.Name == "session_id")
                                _logger.Info("Session ID exists after requesting headers");
                        }

                        if (!responseMessage.IsSuccessStatusCode)
                        {
                            switch (responseMessage.StatusCode)
                            {
                                case HttpStatusCode.BadRequest:
                                case HttpStatusCode.Unauthorized:
                                case HttpStatusCode.Forbidden:
                                case HttpStatusCode.NotFound:
                                case HttpStatusCode.MethodNotAllowed:
                                case HttpStatusCode.Gone:
                                    throw new DownloadException($"Error status code returned: {responseMessage.StatusCode}", 
                                        responseMessage.StatusCode, await responseMessage.Content.ReadAsStringAsync());
                                case HttpStatusCode.Moved:
                                case HttpStatusCode.Found:
                                case HttpStatusCode.SeeOther:
                                case HttpStatusCode.TemporaryRedirect:
                                case HttpStatusCode.PermanentRedirect:
                                    string newLocation = responseMessage.Headers.Location.ToString();
                                    _logger.Debug(
                                        $"{url} has been moved to: {newLocation}, retrying using new url");
                                    return await DownloadStringInternal(newLocation);
                                case HttpStatusCode.TooManyRequests:
                                    retryTooManyRequests++;
                                    _logger.Debug(
                                        $"Too many requests for {url}, waiting for {retryTooManyRequests * _retryMultiplier} seconds...");
                                    return await DownloadStringInternal(url, 0, retryTooManyRequests);
                            }

                            retry++;

                            _logger.Debug(
                                $"{url} returned status code {responseMessage.StatusCode}, retrying in {retry * _retryMultiplier} seconds ({_maxRetries - retry} retries left)...");
                            return await DownloadStringInternal(url, retry);
                        }

                        string retVal = await responseMessage.Content.ReadAsStringAsync();

                        cookies = _httpClientHandler.CookieContainer.GetCookies(new Uri("https://patreon.com"));

                        foreach (Cookie cookie in cookies)
                        {
                            if (cookie.Name == "session_id")
                                _logger.Info("Session ID exists after requesting content");
                        }

                        return retVal;
                    }
                }
            }
            catch (TaskCanceledException ex)
            {
                retry++;
                _logger.Debug(ex,
                    $"Encountered timeout error while trying to access {url}, retrying in {retry * _retryMultiplier} seconds ({_maxRetries - retry} retries left)... The error is: {ex}");
                return await DownloadStringInternal(url, retry);
            }
            catch (IOException ex)
            {
                retry++;
                _logger.Debug(ex,
                    $"Encountered IO error while trying to access {url}, retrying in {retry * _retryMultiplier} seconds ({_maxRetries - retry} retries left)... The error is: {ex}");
                return await DownloadStringInternal(url, retry);
            }
            catch (SocketException ex)
            {
                retry++;
                _logger.Debug(ex,
                    $"Encountered connection error while trying to access {url}, retrying in {retry * _retryMultiplier} seconds ({_maxRetries - retry} retries left)... The error is: {ex}");
                return await DownloadStringInternal(url, retry);
            }
            catch (DownloadException ex)
            {
                throw;
            }
            catch (Exception ex)
            {
                throw new DownloadException($"Unable to retrieve data from {url}: {ex.Message}", ex);
            }
        }
clocklear commented 2 years ago
2022-06-06 15:18:43.2251 DEBUG [PatreonDownloader.Implementation.PatreonPageCrawler] Page #2: xxxxx
2022-06-06 15:18:44.2321 INFO [UniversalDownloaderPlatform.DefaultImplementations.WebDownloader] New request: xxxxx
2022-06-06 15:18:44.2321 INFO [UniversalDownloaderPlatform.DefaultImplementations.WebDownloader] Session ID exists before the request
2022-06-06 15:18:45.0792 INFO [UniversalDownloaderPlatform.DefaultImplementations.WebDownloader] Session ID exists after requesting headers
2022-06-06 15:18:45.0792 INFO [UniversalDownloaderPlatform.DefaultImplementations.WebDownloader] Session ID exists after requesting content
...
...
2022-06-06 15:18:45.7023 DEBUG [PatreonDownloader.Implementation.PatreonPageCrawler] Page #3: xxxxx
2022-06-06 15:18:46.7088 INFO [UniversalDownloaderPlatform.DefaultImplementations.WebDownloader] New request: xxxxx
2022-06-06 15:18:46.7088 INFO [UniversalDownloaderPlatform.DefaultImplementations.WebDownloader] Session ID exists before the request
2022-06-06 15:18:47.5318 DEBUG [PatreonDownloader.Implementation.PatreonPageCrawler] Parsing data entries...

Something is destroying the session_id value. I don't think it is happening based on server response; if they were destroying the session server side, I shouldn't be able to patch in the existing value and have my request complete (which totally works).

TheQwerty commented 2 years ago

I'm just spit-balling here so excuse me if it's way off the mark...

I'm seeing cookies set on both .patreon.com and the sub www.patreon.com and I'm not familiar with the CookieContainer class, so are we sure this is actually retrieving all cookies? cookies = _httpClientHandler.CookieContainer.GetCookies(new Uri("https://patreon.com"));

There also appears to be a longstanding .net bug about retrieving cookie's for .domain that sounds like it might be relevant.

clocklear commented 2 years ago

In my logging, I see cookies for domain .patreon.com as well as patreon.com. These are found by requesting, explicitly, the cookies for https://patreon.com, so your theory is a a good guess @TheQwerty, but I don't (currently) think that's what is going on here.

AlexCSDev commented 2 years ago

I don't think any relevant cookies are being set on www subdomain. session_id cookie is being set on .patreon.com and the api itself lives on the root domain as well, so all cookies retrieved for root domain should apply to api requests.

To me this sounds like some kind of a bug somewhere in .NET's http request pipeline.

AlexCSDev commented 2 years ago

I don't really see how it can remove the cookie yet. https://github.com/dotnet/runtime/blob/6a984143635bde23e728abaaccbde52f5ea8fa3e/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http2Stream.cs#L889

https://github.com/dotnet/runtime/blob/v5.0.17/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/CookieHelper.cs

AlexCSDev commented 2 years ago

I'm thinking of implementing my own cookie management class instead of relying on HttpClient's built-in cookie management. But that will take some time to implement properly.