dpriskorn / OpenAlexAPI

Python library for the OpenAlex HTTP API
GNU General Public License v3.0
22 stars 11 forks source link

Expanding API #10

Open kochbj opened 2 years ago

kochbj commented 2 years ago

Hi Dennis,

Great API. I made some small changes to make the API more useful to me as a user:

  1. Generalized "get works" methods to function with any type of entity
  2. Added set_email convenience function because I wasn't sure how to do that
  3. Fleshed out author fields and modified Ids and Years a little bit so they work with both authors and works.

    Dunno if this is how you planned to extend things so no worries if this doesn't work for you!

Best, Bernie

dpriskorn commented 2 years ago

I took a peek at the code and really like the abstraction and introduction of a OpenAlexBaseDataType. Thanks 🤩 I have a lot on my plate right now, so it will probably take me a week or two to take a closer look.

kochbj commented 2 years ago

Sounds good. Personally, I would've dropped the two works methods and just made one list of entities method, but wanted to make it backwards compatible with current methods.

Also, no worries. I will probably make a few more minor changes in next week or so as I continue to use before you get around to looking at it.

dpriskorn commented 2 years ago

I'm a fan of https://en.wikipedia.org/wiki/KISS_principle?wprov=sfti1 Let's remove redundant methods and keep general ones.

dpriskorn commented 2 years ago

Which ones do you suggest we remove?

kochbj commented 2 years ago

Well if it were me, I would probably remove all entity-specific methods and just keep get list of entities, not make it a private method, with an argument for entity type. I didn't write it this way because I didn't want to break what you already had.

I would keep the cited,reference, and related works methods--I think they are great. For myself I needed to add the below, which should work with any entity type that supports associated_works (authors, venues, institutions). I needed all the header stuff to make sure it would work for dehydrated objects as well.

I'm also going to fool around with a search method if you give me a day or two...


class OpenAlex(BaseModel):
    """This models the OpenAlex HTTP API
    OpenAlex has 2 pools for clients.
    Supplying your email will get you into the polite pool.
    :parameter=email
    """
    email: Optional[EmailStr]
    _base_url: str = "https://api.openalex.org/"
    _headers: dict = {
                "Accept": "application/json",
                "User-Agent": f"OpenAlexAPI https://github.com/dpriskorn/OpenAlexAPI"
            }
        #Convenience dict because dehydrated entities do not have works_api_urls and annoying inconsistencies in endpoints (institution vs instititions, host_venue vs venue)
    _works_urls: dict = {
            Author: _base_url+"works?filter=author.id:",
            Concept: _base_url+"works?filter=concept.id:",
            Institution: _base_url+"works?filter=institutions.id:",
            Venue: _base_url+"works?filter=host_venue.id:"
        }

```    class Config:
        underscore_attrs_are_private = True

    def set_email(self,email: EmailStr):
        self.email = email
        self._headers = {
                "Accept": "application/json",
                "User-Agent": f"OpenAlexAPI https://github.com/dpriskorn/OpenAlexAPI mailto:{self.email}"
            }`

`    @backoff.on_exception(backoff.expo,
                          (requests.exceptions.Timeout,
                           requests.exceptions.ConnectionError),
                          max_time=60,
                          on_backoff=print(f"Backing off"))
    def get_associated_works(self, entity: OpenAlexBaseType, limit: int = None) -> List[Work]:
        """Fetches all works associated with the entity, up to some limit.

        :parameter work is OpenAlex Institution, Venue, Author
        :parameter limit is the maximum number of works to return
        """
        if self.email is None:
            print("OpenAlex has 2 pools for clients. "
                  "Please be nice and supply your email as the first argument "
                  "when calling this class to get into the polite pool. This way "
                  "OpenAlex can contact you if needed.")
        per_page = PAGE_LIMIT if limit is None else min(PAGE_LIMIT, limit)
        works = []
        cursor = '*'
        while cursor:
            url = f"{self._works_urls[type(entity)]}{entity.id}&per_page={per_page}&cursor={cursor}"
            response = requests.get(url, headers=self._headers)
            if response.status_code == 200:
                works += [Work(**w) for w in response.json()['results']]
                cursor = response.json()['meta']['next_cursor']
            else:
                raise ValueError(f"Got {response.status_code} from OpenAlex")
            if limit and len(works) >= limit:
                break
        return works[:limit]
    `
kochbj commented 2 years ago

Last couple of thoughts:

-Institutions, Venues, and Author fields need to be expanded (but methods have to work with dehydrated values) -There are a couple supplementary objects (i.e., years, ids) that show up on multiple core types. Do you want to just expand these so that they work for every entity or have entity specific entities. -The formatting of Year in OpenAlex is annoying; It would make more sense to have these attributes as lists rather than dicts and fill in 0 years rather than omitting them. Do you want to parse these into lists or do you feel that's beyond the scope of the package?

dpriskorn commented 2 years ago

@hstct you implemented a dehydrated author type in #1 which has not yet been merged. Do you have any reactions to this pull request?

dpriskorn commented 2 years ago

I'm also going to fool around with a search method if you give me a day or two...

Have you seen the search function in R here? https://github.com/KTH-Library/openalex/blob/main/R/open_alex_restclient.R BTW I just created a project where we can work on user stories that make sense to aid in the development of the library https://github.com/users/dpriskorn/projects/9/views/1

kochbj commented 2 years ago

Hi Dennis,

If you give me less than an hour, I'm going to implement another branch. This one breaks current works functionality by generalizing for entities. The full set of changes I am adding:

  1. Remove get_works methods and replace with get_entities. BREAKS TESTS but removes a lot of superfluous methods.
  2. Add all additional slots in the OpenAlex documentation to the interface. This included making a few new objects like Geo, and HostVenue. The interface should now capture everything in the API.
  3. Added independent search method (as in API, not sophisticated filter search which I agree should be added).
  4. Added get_associated_works for all five core entities.

Responses to your points:

  1. I actually fooled around with dehydrated objects and I don't think they should be added as separate classes because all of the additional slots are optional. You can simply hydrate these objects by calling get_entities. If you wanted to add a call to get_entitites method to BaseType you could, but I think its cleaner to keep all the query methods in the init file.
  2. I know nothing about SWE, but don't understand the value of enums.py. Why can't these just be strings?
  3. My generalization of entities breaks the tests that I know someone else worked hard on. I unfortunately don't have more time to dedicate to this. :( However, I think it gives you or someone else an opportunity to go through my changes and see if you like them.
kochbj commented 2 years ago

Hi Dennis,

I'm pretty much done with my rewrite in last major push. This does break existing get_works, but I think it does make API quite robust. I also changed my tune on dehydrated objects--they're added.

https://github.com/dpriskorn/OpenAlexAPI/pull/10/commits/e97491f3588d8ca1bb5d54f501a9249214c030aa

Best, Bernie

dpriskorn commented 2 years ago

Big thanks for your effort. I'll try to make time to review it next week. Since we are still in alpha I don't have a problem with removing get_works() at this point.