Kaggle / kagglehub

Python library to access Kaggle resources
Apache License 2.0
42 stars 7 forks source link

Support for Proxy Configuration in kagglehub #126

Closed Lostam closed 1 month ago

Lostam commented 2 months ago

I noticed that the KaggleAPI library supports the use of a proxy by specifying it in the kaggle.json file. Since kagglehub also reads the configuration from the same file, I would like to check if there are any plans to support proxy configuration in kagglehub as well.

If so, I am willing to implement this feature myself. If the kagglehub maintainers are open to this addition, I would appreciate it if someone could review my pull request once it is submitted.

neshdev commented 2 months ago

All of the network request are done via the request library. You can attach the proxy using environment variable as shown here: https://requests.readthedocs.io/en/latest/user/advanced/#proxies ex:

export HTTP_PROXY="http://10.10.1.10:3128"
export HTTPS_PROXY="http://10.10.1.10:1080"
Lostam commented 2 months ago

Thanks for the quick response, I understand that setting the proxy via environment variables is a possible solution. However, there are a few concerns with this approach:

Setting HTTP_PROXY and HTTPS_PROXY environment variables globally affects all libraries and applications running in the same environment, those environment variables are too general and can affect other libraries or applications that also rely on these environment variables for network requests.

It can be confusing when using both kagglehub and kaggle CLI as the latter uses either kaggle.json or KAGGLE_PROXY environment variables instead of the mentioned HTTP_PROXY/HTTPS_PROXY.

Lostam commented 2 months ago

@neshdev I did find KAGGLE_API_ENDPOINT environment variable in the codebase which can act as the proxy in KaggleAPI code. as far as I understand, currently it is used only for testing but is not documented anywhere, is it safe to use in production?

neshdev commented 2 months ago

@Lostam - its unlikely that the HTTP_PROXY/HTTPS_PROXY variables will only used by a subset of processes / resources. It won't make much sense to use the proxy only for the kagglehub library. Most likely, if a proxy is needed, its needed for other things as well.

The kagglehub library is separate from the kaggle cli. There are no dependencies between the two libraries. We will not be following the conventions set by the cli library. The purpose of this library is different than the cli.

Any use of the KAGGLE_API_ENDPOINT other than the way it used is untested for production loads. What do you plan on doing with the variable?