Sefaria / Sefaria-Project

New Interfaces for Jewish Texts
https://www.sefaria.org
664 stars 275 forks source link

GraphQL API #602

Open bandleader opened 3 years ago

bandleader commented 3 years ago

Hello, I'm new as a Sefaria API dev and hope I manage to help out the project a bit as I benefit from it! Thank you to all the devs, contributors, management and backing for an amazing project, literally one of the most important projects in the world.

IMHO, Sefaria would benefit greatly from having a GraphQL API. Of all my suggestions, I think this would be the one with the most wide-ranging benefits for the least cost, as well arguably the most obvious.

  1. Sefaria APIs are relatively complex, and would benefit from the discoverability of GraphQL and/or GraphiQL, especially for new users.
  2. Sefaria APIs send a huge amount of keys for each book/commentary, when usually the requester will end up using one or two. GraphQL lets you specify exactly which fields you need, improving performance as well as server CPU and bandwidth utilization
  3. Sefaria's data is by nature deeply nested (perfect for a graph), so when designing a REST API (which are by nature flat), awkward workarounds are used in order to fetch the depth of the graph. As one example: Some consumers just want a text, some also want a list of commentaries, some want the text of the commentaries as well. Currently you either have to make multiple API calls, or ask for much more than you need.
  4. Sefaria APIs sometimes return data in the incorrect shapes (e.g. #547). With GraphQL we would know exactly when this happens in the log.
  5. Sefaria is constantly evolving. Just as two timely examples, we hope to soon have the Vilna Tzurat HaDaf (#72) available as metadata, and recently on the mailing list there was a discussion about making the Shulchan Aruch siman headings available as metadata, etc. Create separate APIs for all these things will result in a mess as well as multiple requests being necessary. Likely they would be added on as keys or info to every API call, and that means that every additional piece of data we offer will affect server work and bandwidth in a linear fashion.

Example

Just as an example, asking for commentaries on Genesis 1 returns a 46.7 MB JSON document and takes (for me) 20.5 seconds. This is (a) not really an acceptable UX for most applications, (b) actually hard even DX-wise, as Chrome's devtools in fact chokes on the JSON and I can't inspect it properly.

If all an app wanted was the Hebrew text of the Gemara, Rashi and Onkelos (as in #601), that would likely be under 20kb (1/384th of 7.5MB). The app would have no need for titles, section names, English translation, Ramban, and the other 387/388 of the JSON returned (as well as the server utilization cost).

Way Forward

Although I'm not good with Python, I understand that Python works very well with GraphQL and it's quite easy to write servers/resolvers. I won't really be useful writing code, but if necessary I can בל"נ write GraphQL schemas as well as do testing and perhaps even documentation (which can be mostly generated automatically, and even experimented with live using GraphiQL).

If people are not familiar with GraphQL and the DX advantages, I could perhaps create a model schema and deploy a simple API as a sort of live mockup, so you can see what a difference it would make.

I'd also like to say that GraphQL resolvers generally resemble objects/classes with properties/function calls, so it might be very feasible to implement this on top of the existing Python classes, or as a thin layer on top of them. As well, it is entirely feasible for REST APIs to use GQL resolvers to generate their data, reducing code duplication.

EDIT: demo available, see below

bandleader commented 3 years ago

As an example, the URL https://www.sefaria.org/api/texts/Ecclesiastes.5 (with no commentaries) currently returns 54 keys):

bandleader commented 3 years ago

As a demo, I produced a nearly fully-functional GraphQL version of the Text API.

(Currently it is a caching proxy to the official REST API, but the goal would be for you to take the GraphQL schema I wrote and use it in your Python backend.)

You can try it out here: https://enez2.sse.codesandbox.io/

  1. Start by pasting in the query below. Click the ▶ button to run the query.
  2. Start typing book on blank line. Notice the completion and API docs context help.
  3. Use the "Docs" tab on the right side for more info.
  4. See if you can figure out how to fetch commentaries in the same request.
  5. See if you can figure out how to request only Rashi and Onkelos. (spoiler: see second query below)
query {
  text(ref: "Kohelet 5") {
    # Start typing 'book' on the line below, and see what happens

    title {
      en
    }
    textLines {
      he
    }
  }
}

@M-Zuber Use this to get Mikra, Rashi and Targum in Hebrew and nothing else (as requested in #601)

query {
  text(ref: "Vayikra 1") {
    title {
      en
    }

    textLines {
      he
    }

    commentaries(filterByTitles: ["Rashi", "אונקלוס"]) {
      title {
        en
      }
      textLines {
        he
      }
    }
  }
}
EliezerIsrael commented 3 years ago

@bandleader thanks for the suggestion. I have to admit that I don't yet have any experience with GraphQL, outside of trivial integration with Facebook. I'll aim to study up. Who else in the community has experience here? Feel free to sound off on this thread.

bandleader commented 3 years ago

@EliezerIsrael GraphQL has great docs, and I'm happy to help if I can. You're also welcome to use the GraphQL schema I designed (click the 'Schema' tab here), so some of the work is already done.

Also, I added the Sefaria Calendar API to my demo Sefaria GraphQL API. I mention this because it's another example of where having a graph API shines, because you can not only get the level of detail you want for each calendar item, you can even ask for the actual text in the same request. (And optionally other fields, versions, commentaries, translations, filtering, etc... all the regular options I already provide for texts)

# Try this at https://enez2.sse.codesandbox.io
query {
  calendarSections {
    items {
      type { en }
      value { en }
      text {
        textLinesJoined { he }
      }
    }
  }
}

For Shnayim Mikra, just apply filtering -- both on the calendar sections and on the commentaries:

# Try this at https://enez2.sse.codesandbox.io
query {
  calendarSections(filterByTypes: ["Parashat Hashavua"]) {
    items {
      text {
        textLinesJoined { he }
        commentaries(
            filterByTitles: ["Rashi", "אונקלוס"],
            # We don't need Rashi on Judges, for instance,
            # which has the category "Quoting Commentary"
            filterByCategories: ["Commentary", "Targum"] 
        ) {
          title { en }
          textLinesJoined { he }
        }
      }
    }
  }
}

Bonus: since GraphQL is composable, it makes it easy to add parameters like stripTrop, stripNikud, and stripHtmlTags, and they'll work wherever { en } and { he } do. I've already implemented them in my demo; try them out:

# Try this at https://enez2.sse.codesandbox.io
query {
  text(ref: "Kohelet 5") {
    textLinesJoined {
      en(stripHtmlTags: true)
      he(stripTrop: true, stripNikud: true)
    }
  }
}
bandleader commented 2 years ago

@monove Works for me, try again?

monove commented 2 years ago

Working now. I was getting a 503 before. This is so fast and amazing! @EliezerIsrael: this would seem like a win-win for both those using the API and Sefaria as this will lower everyones bandwidth costs significantly and reduce response times and load by an incredible amount, no?

danyeric123 commented 2 years ago

@EliezerIsrael I have experience with GraphQL and I was actually going to suggest it to Sefaria, but didn't know whether my suggestion would be appreciated. I think this is a great idea and I second everything @bandleader said. I would also add that it can especially help for mobile since this was part of the reason why Facebook created GraphQL. I highly recommend this talk to understand the costs and benefits of GraphQL: https://youtu.be/djKPtyXhaNE (Also shameless plug for a blog post I wrote on the topic: https://medium.com/geekculture/graphql-the-good-the-bad-and-the-bottomline-623de7dbcffb )

EliezerIsrael commented 2 years ago

@JonMosenkis has prepped a proof of concept on PR #741

EliezerIsrael commented 2 years ago

One of the concerns that I have is how to avoid pathological queries. For example - if once can query sources linked to a source, what's to prevent a user from querying a search space that would cause a killing load on the webserver?

My inclination is to keep this as a separate branch and deploy it against its own DB instance, read only, until we can get a good picture of the load it causes, and how to put guardrails on it.

danyeric123 commented 2 years ago

@EliezerIsrael That is actually a concern with GraphQL I think. In one of the videos I linked to in my blog post, the person mentions it. There are ways of configuring the GraphQL server to only accept certain queries, but it is not easy and can become complicated quickly. (I have not personally done this, but this is what I found out in my research.)

bandleader commented 2 years ago

This is actually mentioned in the GraphQL docs. In a nutshell: there isn't anything a GQL request can make your server do that it can't already do through REST. It's just that it if you were previously throttling based on raw number of HTTP requests, with GQL you have to take into account that you can have multiple queries in a single request, and also you can have queries nested within other queries (like my query above that gets today's parsha from the Calendar API and then gets Chumash and Targum for it), so you have to take that into account.

Your Python GQL lib of choice may have built-in support for at least timeouts and query depth.

All of this is detailed here, which in turn links here.