gdcc / pyDataverse

Python module for Dataverse Software (dataverse.org).
http://pydataverse.readthedocs.io/
MIT License
64 stars 45 forks source link

API Key and User-Agent should preferrably be passed as headers #192

Closed shoeffner closed 1 week ago

shoeffner commented 1 month ago

Issue

The API key and User-Agent are currently passed as GET parameters, ideally they should be passed as headers. User-Agents are commonly passed as headers, and for the key this is actually preferred. It also is slightly more secure, as the dataverse API will return the full request URL on error, possibly exposing API keys in logs:

{"status":"ERROR","code":404,"message":"API endpoint does not exist on this server. Please check your code for typos, or consult our API guide at http://guides.dataverse.org.","requestUrl":"http://localhost:8080/api/v1/access/datafile/:persistentId/?persistentId=23&User-Agent=pydataverse&key=<the API token>","requestMethod":"GET"}

(taken from datalad/datalad-dataverse#310, where I found out about this; it also contains a little bit of context)

I am not sure if this request is a bug or feature, so I decided to open a normal issue. The pyDataverse version I use is 0.3.3, and the code doing this is in pyDataverse/api.py:

https://github.com/gdcc/pyDataverse/blob/3b6fe063dfc73d6fa3aa674cc02313504c40204f/pyDataverse/api.py#L103-L138

The same is done for post_request, etc.

A possible change would be to use headers and pass them as headers:

        headers = {}
        headers["User-Agent"] = "pydataverse"
        if self.api_token:
            headers["X-Dataverse-key"] = str(self.api_token)

        if self.client is None:
            return self._sync_request(
                method=httpx.get,
                url=url,
                headers=headers,
            )

Additionally, I noticed that neither the params nor the auth attributes are used.

For the params a fix could be, especially after moving the key and user-agent to the headers:

self._sync_request(
    method=httpx.get,
    url=url,
    headers=headers,
    params=params or {},  # maybe httpx.get even handles params=None, in that case they could simply be passed through
)

For the auth parameter it's a bit more difficult, since it currently does not affect the behavior although it should. One possible way would be to update the if statement before adding the token to the header:

if auth and self.api_token:

However, it might make sense to change the default value to True in this case, to keep the behavior stable. However, I am not 100% sure what is the best way to handle this. Of course, one option would also be to remove the auth argument completely.

Lastly, the docstring of the function mentions the requests.Response, while it's now a httpx.Response. While they do have a similar interface, they have some minor differences; e.g., requests has an .ok property and uses iter_content, while httpx does not have .ok and uses iter_bytes.

I think I could try to "fix" this and draft a PR.

JR-1991 commented 1 month ago

Thanks for pointing these issues out, @shoeffner! I completely agree – the API Key should be included in the headers, and it looks like auth and params are either redundant or unused. Here’s what I’m thinking:

The custom class seems more future-proof, especially if Dataverse introduces new authentication methods. The direct fix is quicker, but I’m leaning towards the custom class and keeping auth.

As for the params argument, we haven’t used it yet since query parameters are added to the URL string directly. However, I think we’re missing out on some great httpx features that could make our code more readable and stable. This would mean refactoring most endpoints, and since we’ve already discussed a refactor in #189, I suggest we include this in the api.py update.

shoeffner commented 1 month ago

Reworking auth to be non-boolean but instead httpx.Auth or None sounds like a great idea! In fact, in this case it might be as simple as adding

A callable, accepting a request and returning an authenticated request instance, so there might not be a class required after all to get the mechanism going.

I can't say much about the refactoring in #189, as I am not familiar with the code base in general. But I agree that using a dict for params instead of manually building up the URLs is a good idea.

Let me know if I can help with anything.

JR-1991 commented 1 month ago

Great, that sounds like a plan! Would you be willing to open a PR for this issue? I’d be happy to help out with it.

By the way, we talked about this in today’s PyDataverse Working Group meeting with @pdurbin and @donsizemore. You can watch the recording here and check out our upcoming meetings on py.gdcc.io if you'd like to join next time. For discussions outside of these meetings, we also have a Zulip channel for everything related to Python and Dataverse.

shoeffner commented 1 month ago

Thanks @JR-1991, I'll take a chance implementing that. I checked the recording (here's my gist from the relevant parts of the recording, for those in a rush and cannot access the notes or do not want to go through the recording):

There was not much talk about the User-Agent or params, but as @JR-1991 already mentioned that the params might become part of #189, I'll leave that part out and focus on the first step for the authentication, that is using the X-Dataverse-key header.

pdurbin commented 1 month ago

focus on the first step for the authentication, that is using the X-Dataverse-key header

Sounds perfect. Thanks, @shoeffner! ❤️

You're welcome to test against https://demo.dataverse.org

Or you can run Dataverse in Docker, as mentioned in the README.

shoeffner commented 1 month ago

Thanks, I'll focus on the docker-setup for the moment, and once it's done and green, I'll go for the live instance :)