Sefaria's Text API - Githubissues

Sefaria / Sefaria-Project

New Interfaces for Jewish Texts

https://www.sefaria.org

659 stars 271 forks source link

Sefaria's Text API #1343

Open edamboritz opened 1 year ago

edamboritz commented 1 year ago

Hello Sefaria Developer Community!

We are planning on refactoring one of our main API endpoints - the Text API. While doing so, we are interested in making its use more straightforward and also more flexible.

Currently the API always returns two versions of a requested text reference, A "Hebrew" and an "English".

In recent years, Sefaria's data has branched out to include texts that have non Hebrew source versions (Judeo-Arabic, Aramaic and even English) and also translations of texts into multiple non-English languages (German, Spanish, etc). We have used the current API to try and still provide that data, but this is no longer sufficient.

So we are looking to improve the way the API allows developers to interact with various languages more directly and give them more control of just what they need the text API to return.

For starters, we are looking to get rid of the forced duality of text version in the response. The user will be able to request a single version, or two specific languages, or more.

Beyond that we are interested in hearing what users of this API would find useful. Is it the ability to get all versions of a given language? All translations of a certain text? Asking for a language with a fallback to a default language? Any other suggestions or things you'd want to make use of?

Let us know in the comments!

The Sefaria Team

ronshapiro commented 1 year ago

First of all, thank you! This will simplify a lot of interactions with the API.

I assume that there is a "default" translation/version per language per text, so if there are any APIs that return multiple versions for a particular language, I think it would be useful to label this default. Users may have their own logic for what version they want, but standard use cases probably just want the same default that sefaria.org would show
Any object that has a ref should have that ref specified explicitly. Today, the "root" ref for the request is returned together with some additional information scattered around for how you may want to reconstruct subrefs, but it's non-trivial logic. It would save much headache and awkward heuristic code to have something like:
```
{
ref: "Genesis 1",
segments: [
{
  variants: [...],
  ref: "Genesis 1:1",
},
// Or for Talmud texts
{
  variants: [...],
  ref: "Shabbat 2a.1",
},
],
}
```
Decouple text segments from the JaggedArray structure of the text. Understanding that a text has a jagged array structure is unnecessary complicated in a lot of cases. Tree traversal is always less comfortable than iteration, even if it's not complicated. I'd have the segments presented in a flat list, and then a separate attribute that represents the tree of refs for the rarer case when this is necessary.
As for APIs that accept languages, I could imagine the default would give just the source text with default version, with a parameter that can dictate what versions are desirable for what languages. For example: /api/texts/Genesis_1?langs=hebrew,english:all,german:Some%20Specific%Version,french=[version1,version2] would give the default Hebrew, every English version, a single German version, two French versions, and no Spanish or other language versions. This solves your question about fallback languages for translations: just request the backup language and use it if your priority language isn't present.

bandleader commented 1 year ago

Glad to hear! What ever happened to the idea of the GraphQL API I started designing, in #602? It solves many of the issues with the current API elegantly, extensibly, flexibly, discoverably, and self-documentingly. I believe that by nature it will be much easier to implement as well.

See also POC in #741.

EliezerIsrael commented 1 year ago

@bandleader Regarding GraphQL - it's really interesting, and thank you for the thought and work toward a POC.
After some research, we realized that tools like GraphQL have some vulnerabilities. It's been a while since I've had the idea at the front of my mind, but I recall one major weakness - that it's easy to craft queries that put significant load on the server. Protecting against those kind of queries is often non-trivial. It seems like in industry, GraphQL is used by internal Front-End teams to prototype against a Back-End without needing to request API development, but that for production systems, bespoke APIs are created. Said differently, I don't know if GraphQL can be easily and safely exposed to the wide world.

mayerpasternak commented 1 year ago

I don't know if this is the kind of feature you are interested in trying to implement. But it would be very helpful to us and potentially to others. We use the text api extensively to link to content that we do not have natively in our app. We make ca text api call - and present the content in a local popup within our app. The problem that we have is that amount of text that is presented many times is very long and not precise enough to help a person zero in on the correct part of the reference.

The way that we deal with this with internal links to our own content - is that besides the Page (to main entry in our database) We also include IDs of a Range of Phrases or Range of Words or a list of Multiple Phrases or multiple Words and our internal engine returns the "Page" with the words or phrases highlighted. That way you can see what the author was referring to in context of the whole source.

In our internal texts we either have ids on every word or on phrases. In the Sefaria texts you do not have ids at that level.

It would be nice if we could specify words 20-25 within a source and have those words wrapped in a tag as the selection that we could render as we please.

mayerpasternak commented 1 year ago

In order for that to be useful - we would need to have some kind of view of the sefaria texts on the website where we could see a word value next to each one - so that we could specify what they are in the call to the text api. On our own internal site that serves up our own xml files - we have a way to toggle on IDs so that we can make very precise citations.

especially on a mobile device getting a popup with screens and screens of text - is hard for people to figure out what specifically is being referred to. Here is a crude example citation example

bandleader commented 1 year ago

@EliezerIsrael Regarding GraphQL, I had responded in the original GH issue that there is no inherent security issue with GraphQL. GraphQL APIs are used by large and security-conscious companies all over the world, including by the very GitHub app we are conversing on :)

As described, the one thing about GraphQL that is relevant security-wise is that queries are very flexible, and you can write queries inside other queries, therefore a user could write a query that takes a huge amount of work to execute, and it still looks like a single request, so if you're rate-limiting by number of requests, then you have a problem.

However:

Sefaria's REST API has this problem too. Right now, I can run a query for Genesis 1 with commentaries and it will tie up your CPU for ~20.5 seconds.
The solution (in both cases) is very simple. You simply rate limit by something less arbitrary than number of HTTP requests.

Either you can count the number of sub-queries, i.e. if you ask for commentaries then every commentary counts as a 'hit' towards the user's API quota, and same for GraphQL where you can request multiple texts in a single query, every text requested counts as a 'hit'

Or, you can simply measure the amount of CPU time taken up by a query, and rate-limit based on that. e.g. you can only send 5 seconds of queries every minute, and you can't have more than 5 minutes per hour, etc. This is quite easy.

The GraphQL docs detail these things here, which in turn links here.

Let me know if you need any clarification!

nissamai commented 1 year ago

@bandleader I do think it's important to note that since the Sefaria.org application makes use of the APIs to run the application, switching to GraphQL would require the dev team to update the web application architecture to use the GraphQL runtime instead of the REST API on the web app, which in addition to being a ton of work, could come along with its own set of issues (cacheing, etc), it makes sense for there to be some resistance to using it even outside of the security question.

Re improvements to the Text API, assuming we're keeping the same architecture: I've also noted the issue where it's possible to make fairly large queries with our current API with the commentary flag. My suggestion here was going to be that if we want to continue to allow pulling connections outside of the links api, the texts API should have more granular system for requesting commentaries & connections and/or just the indices of such (i.e. to make it possible to query the text of or metadata about particular commentaries along with the base text but also to limit the default scope of what gets pulled into the response in some thoughtful way).

I also think that it might make sense for all the flags that feel very "coupled" with particular app behavior and defaults, along with some of the default values should be re-evaluated (a number of different people I think have asked why the default behavior is context=1, for example, and I my assumption was that that default is tied to what is most useful for the web application; I would think that outside of that it would generally make sense for the default to be 0, both in terms of default resource consumption and what the user would expect).

I like @ronshapiro's idea about requesting languages.

bandleader commented 1 year ago

@nissamai

GQL requires no "runtime" client-side (which is what I assume you mean). fetch and a template string is all that's required. (The optional tooling is for advanced use cases but even without it you'd get the same experience or better than with REST.)
I'm not aware of any 'caching' issues any different than REST API.
The old API would likely be preserved for quite a while, and the frontend updated gradually. No difference between choice of REST or GQL for the APIv2.

nissamai commented 1 year ago

@bandleader ah, I think I misunderstood the changes you were suggesting on some level (just looked at the linked issues/POC and see that it's just an HTTP endpoint that handles the client requests written in GQL). Thanks for clarifying!

EliezerIsrael commented 1 year ago

@bandleader GraphQL does merit consideration, but I think that scale of complexity of implementation is too much for our team to swallow at the moment.

@mayerpasternak We made a design decision way back in our early days - that we divided into segments and not words. It let us move quick, but it definitely has downsides. We run up against it ourselves.
It seems to me, given the implementation constraints, that word level highlighting belongs a level above the bare texts API. I could imagine something on an SDK level that takes a Ref and a string of text (or text boundaries of some sort), queries the Sefaria API, then wraps the needed text. It seems like you've implemented it in-house, but I could imagine that provided at the Sefaria SDK level.

@ronshapiro You're right about iteration and the weirdness of JaggedArrays. Good time to bring that up. And your thoughts about request format are interesting. I think we do want to allow the user to specify a list of languages (3 letter language codes, likely.) The highest priority original text will probably have a reserve word like “base”. We need to specify how to specify, language by language

One highest-priority version of a lang
A specific version of a lang
Multiple specific versions of a lang
All versions of a lang

Your suggestions ticks most of those boxes. I'm wondering - have you seen anything similar in the wild? It seems like we can't easily avoid a busy syntax for this.

mayerpasternak commented 1 year ago

World level highlighting could be done externally – but your website numbers does not show word numbers – so our scholars are working blind – unless we create a new interface to your texts those expose the number of each word. We would also need to take your response from the text api and assign numbers to each word – to create the highlighting. We could possibly create all of this outside of your system – but I suspect that there are other users that could benefit from this – so it might make sense to add this as an option - instead of everyone building their own system.

From: Lev Eliezer Israel @.> Sent: Sunday, May 7, 2023 9:58 AM To: Sefaria/Sefaria-Project @.> Cc: Mayer Pasternak @.>; Mention @.> Subject: Re: [Sefaria/Sefaria-Project] Sefaria's Text API (Issue #1343)

@bandleaderhttps://github.com/bandleader GraphQL does merit consideration, but I think that scale of complexity of implementation is too much for our team to swallow at the moment.

@mayerpasternakhttps://github.com/mayerpasternak We made a design decision way back in our early days - that we divided into segments and not words. It let us move quick, but it definitely has downsides. We run up against it ourselves. It seems to me, given the implementation constraints, that word level highlighting belongs a level above the bare texts API. I could imagine something on an SDK level that takes a Ref and a string of text (or text boundaries of some sort), queries the Sefaria API, then wraps the needed text. It seems like you've implemented it in-house, but I could imagine that provided at the Sefaria SDK level.

@ronshapirohttps://github.com/ronshapiro You're right about iteration and the weirdness of JaggedArrays. Good time to bring that up. And your thoughts about request format are interesting. I think we do want to allow the user to specify a list of languages (3 letter language codes, likely.) The highest priority original text will probably have a reserve word like “base”. We need to specify how to specify, language by language

One highest-priority version of a lang
A specific version of a lang
Multiple specific versions of a lang
All versions of a lang

Your suggestions ticks most of those boxes. I'm wondering - have you seen anything similar in the wild? It seems like we can't easily avoid a busy syntax for this.

— Reply to this email directly, view it on GitHubhttps://github.com/Sefaria/Sefaria-Project/issues/1343#issuecomment-1537448839, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ATPMSCKZ3SNIOZD46245UB3XE6S73ANCNFSM6AAAAAAXKVYBBU. You are receiving this because you were mentioned.Message ID: @.**@.>>

ronshapiro commented 1 year ago

I care little about how to specify the versions as long as there are those options. I imagine that the common bugs are going to be around url encoding version titles. You could try and do something clever with http headers... but those are always a little less obvious.

Another idea about jagged arrays: make the format/structure configurable. Perhaps there are people that want jagged arrays structures. But make them ask for it.

בתאריך יום א׳, 7 במאי 2023, 17:05, מאת mayerpasternak ‏< @.***>:

World level highlighting could be done externally – but your website numbers does not show word numbers – so our scholars are working blind – unless we create a new interface to your texts those expose the number of each word. We would also need to take your response from the text api and assign numbers to each word – to create the highlighting. We could possibly create all of this outside of your system – but I suspect that there are other users that could benefit from this – so it might make sense to add this as an option - instead of everyone building their own system.

From: Lev Eliezer Israel @.> Sent: Sunday, May 7, 2023 9:58 AM To: Sefaria/Sefaria-Project @.> Cc: Mayer Pasternak @.>; Mention @.> Subject: Re: [Sefaria/Sefaria-Project] Sefaria's Text API (Issue #1343)

@bandleaderhttps://github.com/bandleader GraphQL does merit consideration, but I think that scale of complexity of implementation is too much for our team to swallow at the moment.

@mayerpasternakhttps://github.com/mayerpasternak We made a design decision way back in our early days - that we divided into segments and not words. It let us move quick, but it definitely has downsides. We run up against it ourselves. It seems to me, given the implementation constraints, that word level highlighting belongs a level above the bare texts API. I could imagine something on an SDK level that takes a Ref and a string of text (or text boundaries of some sort), queries the Sefaria API, then wraps the needed text. It seems like you've implemented it in-house, but I could imagine that provided at the Sefaria SDK level.

@ronshapirohttps://github.com/ronshapiro You're right about iteration and the weirdness of JaggedArrays. Good time to bring that up. And your thoughts about request format are interesting. I think we do want to allow the user to specify a list of languages (3 letter language codes, likely.) The highest priority original text will probably have a reserve word like “base”. We need to specify how to specify, language by language

One highest-priority version of a lang

A specific version of a lang

Multiple specific versions of a lang

All versions of a lang

Your suggestions ticks most of those boxes. I'm wondering - have you seen anything similar in the wild? It seems like we can't easily avoid a busy syntax for this.

— Reply to this email directly, view it on GitHub< https://github.com/Sefaria/Sefaria-Project/issues/1343#issuecomment-1537448839>, or unsubscribe< https://github.com/notifications/unsubscribe-auth/ATPMSCKZ3SNIOZD46245UB3XE6S73ANCNFSM6AAAAAAXKVYBBU>.

You are receiving this because you were mentioned.Message ID: @.**@.>>

— Reply to this email directly, view it on GitHub https://github.com/Sefaria/Sefaria-Project/issues/1343#issuecomment-1537450406, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGBRXPPQRFTYUPYBI5X2OLXE6TZJANCNFSM6AAAAAAXKVYBBU . You are receiving this because you were mentioned.Message ID: @.***>

EliezerIsrael commented 1 year ago

@mayerpasternak I hear what you're looking for. I just opened up a new issue for it, so we can workshop highlighting subsections of a segment on its own terms.

bdjnk commented 1 year ago

Please provide an option to ensure requests don't receive any markup in the returned text, but rather plain text alone. Currently HTML is returned in some version of some texts.

For example, when I simply request https://www.sefaria.org/api/texts/genesis-1:1 I receive the first two verses of the text as follows.

"text": [
  "When God began to create<sup class=\"footnote-marker\">*</sup><i class=\"footnote\"><b>When God began to create </b>Others “In the beginning God created.”</i> heaven and earth—",
  "the earth being unformed and void, with darkness over the surface of the deep and a wind from<sup class=\"footnote-marker\">*</sup><i class=\"footnote\"><b>a wind from </b>Others “the spirit of.”</i> God sweeping over the water—",

This is not easily digestible by any system not intending to directly display the returned text in an HTML context as it is riddled with markup. Further, stripping the tags is non-trivial as in these cases their contents should be deleted, whereas in other cases perhaps it should be retained. And, on a more abstract level, it interleaves secondary text with the primary text in a particular manner, which constrains its usefulness even in an HTML context.

rneiss commented 1 year ago

@bdjnk -- You can get part of the way there by tacking on a stripItags=1 parameter --

e.g.: https://www.sefaria.org/api/texts/genesis-1:1?stripItags=1

["When God began to create heaven and earth—", "the earth being unformed and void, with darkness over the surface of the deep and a wind from God sweeping over the water—"],

It still leaves the markup for bold, italics, etc, but removes the footnoted content.

shelfgot commented 1 year ago

I would very much appreciate if y'all would be willing to make the Texts API support Hebrew refs, e.g. searching בראשית כג:א or בראשית 23.1. I know that there is a list of Hebrew titles out there somewhere in the codebase, because I searched for it a few months ago, but discovered that the Python method (or lines of code, I dont remember) that would have exposed this book list - a critical part of the queries I work with, which uses strong autocomplete in order to validate ref titles to then make said queries - was commented out, so that's an easy fix. It could very well be that this is implemented and (as with many updates to the API over the past few years) I happen to be out of the loop, but I wasn't able to get anywhere with some cursory queries of this type.

rneiss commented 1 year ago

Hey @shelfgot -- the texts api davka does support Hebrew refs -- e.g. בראשית כג:א -- it does however require that Hebrew text to be percent encoded (the browser itself does this, but in code you may need to do so explicitly)