Sefaria / Sefaria-Project

New Interfaces for Jewish Texts
https://www.sefaria.org
657 stars 271 forks source link

API: text.length != he.length #543

Open ronshapiro opened 4 years ago

ronshapiro commented 4 years ago

https://www.sefaria.org/api/texts/Shabbat.56b returns an API response where the text attribute is a list of 19 items, but the he attribute has 20 items.

It's unfortunate that parallels arrays are part of the API, but if they are, shouldn't they at least be of equivalent lengths?

I presume this is because the Hadran doesn't have an English translation (though many/most do, AFAICT). But I feel like this could probably a bigger problem.

At a minimum, I think these should return equal-length lists. A nice to have would be an API v2 where one list is returned, with nested objects that each have text/he attributes (and if a particular segment doesn't have one or the other, these attributes can be omitted).

ronshapiro commented 4 years ago

I'm running a script now to find all of the instances here (in case the quick fix is to just update the texts. So far I've seen the same in Bava Batra 27b, 60b, and 91b, all with Hadran as the last segment. But interestingly Shabbat 34b, which also has Hadran last, does have an English translation.

ronshapiro commented 4 years ago

Here's the full list:

ronshapiro commented 4 years ago

Kiddushin 22b is an exception, and it's an interesting one:

https://www.sefaria.org/Kiddushin.22b.20?lang=bi&with=all&lang2=en

Kiddushin.22b.20 is just English, no Hebrew/Aramaic, and the English is a translation of the final words of 22b.19.

Actually, the end of 22b.19 has an English translation. It seems that what is happening is that 22b.19 is selecting the William Davidson English translation, and then the website proceeds to 22b.20 which is only present in the Daf Shevui translation (https://www.sefaria.org/Kiddushin.22b.19?ven=Daf_Shevui&vhe=William_Davidson_Edition_-_Aramaic&lang=bi&with=all&lang2=en). That's why the website (and the API) appears to have some "duplicate" text - there are two translations presented!

I mentioned this to corrections@sefaria.org also: I think the easiest fix for this is to just merge the last sentence of the Daf Shevui into 22b.19.

JonMosenkis commented 4 years ago

This is not a bug, it's a feature. In every text in Sefaria, we supply the English and Hebrew in their own prospective arrays. This is a special case of English and Hebrew being mismatched. There are entire books which only exist in one language.

JonMosenkis commented 4 years ago

Also, it's important not to separate the concerns of the backend from the frontend apps that consume the data. Someone asking for just one language doesn't necessarily want his data padded with blank entries. Our backend should represent the data as it is. Frontend apps can then manipulate that data as needed.

ronshapiro commented 4 years ago

So how do you explain a response like https://www.sefaria.org/api/texts/Shulchan_Arukh%2C_Orach_Chayim.40?

Only the last segment has an English translation, so the text array is padded with blank strings.

This is a problem with parallel arrays - you need to have a way to correlate values. Am I wrong to assume that these arrays are at least semi-parallel?

Having an empty array for either of these values is understandable if there is no translation at all. But if there are some correlating values, there must be a sensible way to correlate them.


What I suggested about a single array of objects would address all of these concerns. If there is no translation, then the translation attribute is just missing from each element. If it's present on only some, than only those elements have the attribute. I'm not suggesting that you implement this now - API migrations take time and planning. Just a thought experiment.

nsantacruz commented 4 years ago

@ronshapiro Thanks for the suggestion. Like @JonMosenkis said, there is no bug here since the two arrays can be combined and padded on the frontend (by matching each corresponding element and then padding the shorter array to match the length of the longer one). This is however a somewhat confusing API since it's not immediately clear what the proper way to combine the two arrays is.

We will likely need to refactor this API once we start adding more languages to our database (bh in the near future) in which case having more than 2 arrays at the top level will not scale well. When that happens, your suggestion is a good one for how to refactor.

ronshapiro commented 4 years ago

APIs should be designed to be easily understandable, without any gotchas. This isn't a question about massaging so frontends can be stupid, it's about making the data you expose clear. Consistency helps remove developer confusion.

If there are two arrays that are seemingly parallel, but not always of equal length, it's unclear how elements should line up. If some translations don't exist, are only present translations in the list, and if so, how do I correlate? Oh wait, sometimes I see blanks are returned - so it seems like there will always be a matching element if a translation is present for some elements. Oh, now I see that blanks are not always returned, why is that? Are the blanks that I've seen in the past programmatically inserted (that seems reasonable, but why are they sometimes missing?) or are they manually inserted (seems like a lot of busywork, but maybe that's why they're sometimes missing, and if so, maybe I can't rely on my assumption before of correlating values in the array)? \

If on day N, text Y has 5 Hebrew segments, and only the first and last with translations in the Sefaria Community Translation. On Day N + 1, the last translation is deemed wildly inaccurate by corrections@sefaria.org, and removed from the database. MyCoolApp that was previously expecting matching elements is now broken because the shape of the array is totally different.

Is one the array in the text's source language always as long, or longer than the translation array? (A bug in the Daf Shevui translation for Kiddushin 22b mentioned above made this not the case until I reported it to corrections@ and they fixed it.) What does it mean that the translation array is longer than the source text array? Which of the elements is a duplicate, and how do I write an algorithm to figure that out?

To answer these questions, users of the API need to search real world examples to verify their assumptions. They may not find good examples, and they may guess incorrectly.

Some of these are hypothetical, but most of them are real questions that I had, and some that I still have because the documentation makes no guarantees or specifications of the many assumptions that I have made.

I mentioned above, but worth rementioning - I think the current behavior of returning an empty list if there is no text in a particular language is totally reasonable. It's meaning is clear (and documented!).

It's also worth mentioning that gzipped, repeated blank strings are effectively free. If the concern of omitting them in the corner case that a translation is missing at the end of a text is one of performance, that doesn't seem to be well founded, and it also isn't worth a confusing API IMO.

EliezerIsrael commented 4 years ago

This is a good suggestion to keep in mind as we architect the API to cover more languages.

ghost commented 2 years ago

If this mismatching lengths is supposed to be a "feature" (it's really not, it's just saving on space, actually), then it needs to be documented. Currently, Sefaria's API documentation is not so great, and it's probably because the API is overly complicated and badly architected.

I'm trying to create a Gemini proxy/interface using Sefaria's API, and one look at the /api/index results makes me want to completely give up on trying to do this. I don't know how any frontend developers continue to use this API - It'd make me want to pull my hair out everytime I have to do something with it.

edamboritz commented 2 years ago

We are actively working to improve documentation, thanks for the feedback.