carsonyl / pypac

Find and use proxy auto-config (PAC) files with Python and Requests.
https://pypac.readthedocs.io
Apache License 2.0
71 stars 18 forks source link

consider that TLDExtract cache will be used by default when evaluating WPAD #74

Closed KarelChanivecky closed 11 months ago

KarelChanivecky commented 1 year ago

TLDExtract performs an HTTP query to fetch valid top level domains. This is fine, except that this library will be mostly run within the context of a domain where proxy is enforced.

Enterprises that enforce proxying, are also likely to block requests that are not dispatched per policy. For this reason, it doesn't make sense to dispatch an HTTP request with the purpose of evaluating the proxy, as the proxy URL is more likely than not to be needed to dispatch such request.

For this reason, it should be considered that the base case for the library is that TLDExtract will not be able to dispatch this request and that it will fallback to the file with the TLDs.

Hence this library should:

Some of the options used by TLDExtract are not bad at all, however, they are not able to accommodate all cases. For example, within a pyinstaller executable, in which the package directory itself will be the location where the executable is located. In such cases where the application is being distributed on scale, the application may choose to contain a specific directory for such uses. Thus, applying one of the recommendations would be meaningful, and avoid the implementer a deep-dive into foreign code.

KarelChanivecky commented 1 year ago

I have resolved the issue in my project by overriding the TLDExtract DiskCache class with my own cache implementation. My implementation reads from a user defined file and keeps the data. It also allows for setting the file contents. I also added a feature to check if the file contents have changed every certain time interval.

In my project's case, this will allow us to specify the filename where we want the TLD data stored. The stored data can now be accessed with a privileged writer, unprivileged reader pattern, as the readers will never try to write to the file. Fetching and maintaining the data can then be outsourced to a different module.

I will try to contribute to tldextract with such an implementation, then I can try to contribute here based off of that.

carsonyl commented 1 year ago

Does this relate to https://github.com/carsonyl/pypac/pull/64? The intent was for PyPAC to use tldextract such that it never goes online and never updates its TLD list.

KarelChanivecky commented 1 year ago

We got a crash in our project because we are packaging everything with pyinstaller and the directory structure is different. Hence, why we had to come up with a hack. Just now I realize there was no actual request dispatched although the stack trace suggests it. There may be some other scenarios when this comes into play.