carsonyl / pypac

Find and use proxy auto-config (PAC) files with Python and Requests.
https://pypac.readthedocs.io
Apache License 2.0
71 stars 18 forks source link

Doesn't handle redirects #67

Closed ZzZombo closed 1 year ago

ZzZombo commented 1 year ago

pypac doesn't handle redirects in PAC server response. Additionally, it should consider both the original URL's response and the target URL's Content-Type for validation. I've discovered certain servers serve them with different content types, so if at least one response's Content-Type satisfies the check, it should be considered valid.

carsonyl commented 1 year ago

Can you clarify what you mean about Content-Type handling? Redirect handling is likely a duplicate of #37, but would be nice to have for sure.

ZzZombo commented 1 year ago

Your code only needs to have resp = sess.get(pac_url, timeout=timeout,allow_redirects = True) in api.py added in the code in order to allow redirects, but given there is now a response for each redirect and the final URL, you need to check each Content-Type for validity, if at least one is acceptable, then the content type check must succeed for the whole redirection chain. And no, it's not a duplicate of the linked issue, it's for the PAC file itself, not for purely a HTTP redirecting handling.

carsonyl commented 1 year ago

Requests defaults allow_redirects to True if not specified, so my understanding is that https://github.com/carsonyl/pypac/blob/master/pypac/api.py#L157 should already follow redirects, and the content type check only happens at the end.

ZzZombo commented 1 year ago

I do not know what's the deal with that, I did check the docs myself beforehand, but adding that absolutely does solve the redirect issue for me, and I'm not really motivated to find out whether that's a case of outdated docs or a bug or something else. And I also do absolutely see now how the content type check is done only once, only for the final response. That's the part that needs to be changed for robustness sake if redirects are allowed.

carsonyl commented 1 year ago

Explicitly setting allow_redirects to true is easy. But checking each redirection for PAC Content-Type match does not sound correct: why would the server return PAC headers and body but also HTTP 301/302 and Location header? Besides, that likely needs allow_redirects to be false, with pypac manually handling redirects in order to implement such logic.

ZzZombo commented 1 year ago

Because it's how just how it is in the wild. You seem to misunderstand the issue at hand, The Antizapret PAC file is first served with the appropriate for a PAC file content type in the redirect response (no body), then the actual, final URL is served with a content type of a regular JavaScript file and the body of the PAC file.

ZzZombo commented 1 year ago

I also think it merits checking only the first and the last responses in the chain for content type.

carsonyl commented 1 year ago

I've never heard of the behaviour you describe. It sounds very strange. I think it's best if you keep using your workaround (I assume you have one). I don't want to make potentially major code changes for this, especially because I don't have a repro to test against.

ZzZombo commented 1 year ago

OK, I'm gonna apologize for false report. The application didn't get a (valid, expected) response from the remote server. By the time I altered the function, the issue was resolved somehow, I think it's because we started to supply an user agent string that wasn't banned per the PAC script policy and made sure to abide to the rate limits.