ORCID / python-orcid

Python wrapper around ORCID API
BSD 3-Clause "New" or "Revised" License
91 stars 26 forks source link

Save the actual http response in an instance attribute #26

Closed turtle321 closed 6 years ago

turtle321 commented 6 years ago

We have some use cases in our team (Inspire at CERN) where we need some introspection on the actual response (the response coming from the Python HTTP requests library). Introspections about HTTP status codes, request and response content and headers

turtle321 commented 6 years ago

I am about to write tests for this change, but I am wondering how to run the tests: I defined all the required OS vars (CLIENT_KEY, CLIENT_SECRET, USER_PASSWORD, USER_EMAIL, REDIRECT_URL, USER_ORCID) but tests like test_read_record_member_xml fail because there is no work for the user I have just created. Any suggestion? @MSusik

turtle321 commented 6 years ago

Ok I wrote some basic tests

MSusik commented 6 years ago

In case of statuses other than 200, you can catch RequestException as shown in the README and retrieve the request or response.

In case of a 200 code, what kind of information would be required?

turtle321 commented 6 years ago

Yes, in case of exception the client code has access to the requests' exception and thus the actual response and request. I would like to have that access also in case of success.

My current use case is to know the HTTP status code in the response. You suggest to assume that it is a 200 when no exception is raised. But that's not enough: request.raise_for_status() raises for 4xx and 5xx status codes. There is a list of 2xx and 3xx status codes that might result in a successful response, not just 200.

Other use cases I might be interested when debugging is having access to the response headers (for caching, authorization, access control, last modified, ...) or even the request headers/content.

And finally, the code change is just about replacing response to self.response, so having the response part of the public interface of that object. It's a small change that enables the features I mentioned.

MSusik commented 6 years ago

OK, I found a bit of time to review. First of all, thanks for the PR, I agree that such functionality is useful.

However, I see some things to improve. First of all, the _authenticate method performs two requests to ORCID. With this PR, the developer is unable to get the first response. Thus, the feature is not implemented for the whole library.

Moreover, I'm not sure if by default this feature should be turned on. If the response from the server is heavy, the software that uses this wrapper will not be able to release the memory unless another request is performed or the wrapper is released. Usually, API wrappers do not implement this feature at all! (see https://github.com/mobolic/facebook-sdk/blob/5d6e78fe26e2e10ebbf06628c20a9f9eb3b50062/facebook/__init__.py#L259 and https://github.com/bear/python-twitter/blob/b2af803a8354cadcc646fe6507fb744d828ff693/twitter/api.py#L309)

It turns out it is not that easy to make this optional seamlessly. I see two solutions that I demonstrate on a toy class

1).

class WithSavedResponse:

    def __init__(self, save_last_response=False):
        self.save_last_response = save_last_response
        self._last_response = None

    def set_last_response(self, value):
        if self.save_last_response:
            self._last_response = value            

    def get_last_response(self):
        if self.save_last_response:
            return self._last_response
        else:
            raise AttributeError("The save last response option was not chosen. Last response not available")

    last_response = property(get_last_response, set_last_response)

    def function(self):

        response = 'content returned by api'
        self.last_response = response

        return response[1]

w = WithSavedResponse()
try:
    w.last_response
except AttributeError:
    print("caught")
w = WithSavedResponse(save_last_response=True)
print(w.last_response is None)
res = w.function()
print(w.last_response == 'content returned by api')
print(res == 'o')

2).

import inspect

def remove_last_response(fn):

    def wrapper_func(*args, **kwargs):
        retval = func(*args, **kwargs)
        del self.last_response
        return retval

    return wrapper_func

class WithSavedResponse:

    def __init__(self, save_last_response=False):
        if not save_last_response:
            for attr in inspect.getmembers(self, inspect.isroutine):
                function_name = attr[0]
                if (
                        function_name[:2] != '__'
                        and function_name[:-2] != '__'
                        and callable(getattr(self, function_name))
                ):
                    # No builtins allowed here
                    setattr(self, attr[0], remove_last_response(getattr(self, function_name)))
        else:
            self.last_response = None

    def function(self):

        self.last_response = 'content returned by api'
        return self.last_response[1]

w = WithSavedResponse()
try:
    w.last_response
except AttributeError:
    print("caught")
w = WithSavedResponse(save_last_response=True)
print(w.last_response is None)
res = w.function()
print(w.last_response == 'content returned by api')
print(res == 'o')

One of them requires additional line for every response, the other requires using inspect library that is not considered nice. I would say that option 1). is better, but there might be something smarter that would work better.

turtle321 commented 6 years ago

I see your point about the design of popular API that do not offer this feature. Anyway, in case of a library that wraps another library, I prefer the design that gives access to the original library. This gives more flexibility. Or, to return not just the response content, but also the status code and the headers. That's the case of Elasticsearch pyhton client: https://github.com/elastic/elasticsearch-py/blob/a03504cbb2af34490e31c7b912165028a1535381/elasticsearch/connection/http_requests.py#L94 This option though whould change the signature of our responses, thus I prefer the first one I mentioned.

About the memory usage: the only case when a response is heavy, is when response.content is large. Let's look at a typical use case:

api = orcid.PublicAPI(institution_key, institution_secret, sandbox=True)
...
summary1 = api.read_record_public('0000-0001-1111-1111', 'activities', mytoken)
...
summary2 = api.read_record_public('0000-0001-1111-2222', 'record', mytoken)

In this case the value returned by the call (summary1) is large (and under the hood pointing to the same memory space as PublicAPI::self.response). The garbage collector will take care of freeing up memory when api is not used anymore, the same for summary1. If instead api is re-used then PublicAPI::self.response is overridden and the memory freed up. So the memory footprint is not significatly affected by this code change (given also the type of data we are dealing with: text).

I'd love to have this feature optional, but given that this would mean to complicate the code a lot, I'd prefer to have this as default behavior.

And about the _authenticate() method, this method performs 2 operations: asking for the authorization code and logging in using the end-user's credentials. I suggest to either remove self.response in this method entirely or to store only the authorization code call's response.

(the issue arises from the fact that the end-user login should not happen here: the end-user should be redirected to ORCID's website where he should login and be redirected back to our website with an authorization code that later on we exchange for an access token - using get_token_from_authorization_code. We are an untrusted application and we should not know the end-user's ORCID password. For trusted application instead, OAuth supports the password flow - where the third-party knows the end'user's password - but this is not available in ORCID's API because they have no trusted apps)

MSusik commented 6 years ago

I see your point about the design of popular API that do not offer this feature.

Still I believe it should be available. I really like the solution of elasticsearch wrapper. Note that the response is not saved - the relevant fields from request are returned to the user. Unfortunately we can't use this solution due to backcompability.

So the memory footprint is not significatly affected by this code change

I would argue against it. Obviously as we deal with text, the footprint is not measured in GBs. But the library should not be an obstacle if sb wanted to write a software where downloading data from ORCID is a part of a large pipeline spawned in hundreds of processes. Then you can write code like:

class Pipeline:
    def __init__(self, arg1, arg2):
        self.api = orcid.memberAPI(arg1, arg2)

    def run_pipeline(self):
        small_info = self.get_data_from_orcid()
        self.run_time_consuming_code(small_info)

    def get_data_from_orcid(self):
        small_info = orcid.something()['somekey']
        return small_info
    ...

I am not sure what is the behaviour of the API now, but back when version 1.2 of API was used, in case of downtimes, an HTML page with significant amount of JS code in it was returned as content for any request.

I'd love to have this feature optnional, but given that this would mean to complicate the code a lot, I'd prefer to have this as default behavior.

Sometimes API object is released just after it was used (and probably it's the case of INSPIRE) but in many cases it is not the natural way to write the code. Actually, one does not expect from a wrapper to store any data that is not required for the communication. If the feature is optional and documented in README, developer who uses the library will be aware of this behaviour. Otherwise he stumbles upon a weird behaviour he might need/want to debug.

I would say in this case it's better to add few more lines in the code in order to provide the user with expected behaviour. I came up with one more solution (don't know if it's nicer than previous):

from contextlib import contextmanager

@contextmanager
def catch_response(instance):
    instance.save_last_response = True
    yield
    instance.save_last_response = False
    instance._last_response = None

class WithSavedResponse:

    def __init__(self):
        self.save_last_response = False
        self._last_response = None

    def set_last_response(self, value):
        if self.save_last_response:
            self._last_response = value            

    def get_last_response(self):
        if self.save_last_response:
            return self._last_response
        else:
            raise AttributeError("Last response is only availabe within")

    last_response = property(get_last_response, set_last_response)

    def function(self):

        response = 'content returned by api'
        self.last_response = response

        return response[1]

w = WithSavedResponse()
try:
    w.last_response
except AttributeError:
    print("caught")

with catch_response(w):
    res = w.function()
    print(w.last_response == 'content returned by api')
print(res == 'o')
print(w._last_response == None)
try:
    w.last_response
except AttributeError:
    print("caught")

And about the _authenticate() method, this method performs 2 operations: asking for the authorization code and logging in using the end-user's credentials. I suggest to either remove self.response in this method entirely or to store only the authorization code call's response.

OK, I fully agree. Probably the most consistent behaviour is to store only the request from get_token_from_authorization_code that is called in _authenticate. I assume that's what you meant by removing self.response entirely.

As the discussion is getting longer: Is there a chance that what you want to achieve within INSPIRE would be a good functionality for this library? Maybe the response does not have to be saved if the handling can be done within this codebase.

turtle321 commented 6 years ago

All right. I made this feature optional, I found a simple way with a kwarg in the client constructor. So this client would have the feature disabled:

orcid.PublicAPI(institution_key, institution_secret)

while this one would have it enabled:

orcid.PublicAPI(institution_key, institution_secret, do_store_raw_response=True)
MSusik commented 6 years ago

@turtle321 1.0.3 was released. It includes this PR.