Open ronshapiro opened 4 years ago
Good catch! Thanks! We'll try and get this fixed.
I dug into this a little bit. Here's what I learned.
This bug doesn't impact our production server, because we're not using the links api to get text anymore. We get the list of links (we happen to use the related
api), and then we load the text of the links closer to when they're viewed. Loading all of the link text for a section at once was too heavy.
That's how we managed to overlook this, I believe.
When we load the text with the texts API, it gets returned well. The error is a result of some optimization techniques in the bulk loading of texts-from-links that misses some edge cases.
The fix for the error isn't immediately obvious to me.
Details
It boils down to the behavior of Ref.split_spanning_ref
.
According to its documentation it has this expected behavior:
>>> Ref("Shabbat 13b:3 - 14b:3").split_spanning_ref()
[Ref('Shabbat 13b:3-50'), Ref('Shabbat 14a'), Ref('Shabbat 14b:1-3')]
When getting text for the ranged reference, get_links()
splits the reference using split_spanning_ref()
, gets text for each subref, then joins them together.
In [1]: Ref("Rosh on Shabbat 6:21-8:1").split_spanning_ref()
Out[1]:
[Ref('Rosh on Shabbat 6:21'),
Ref('Rosh on Shabbat 7'),
Ref('Rosh on Shabbat 8:1')]
The first and last return a list of strings, with one element. The middle ref returns a list of lists of strings.
All three of those get put together into a list three elements long.
What about looping subrefs()
for each of split_spanning_ref()
?
Or just flattening the array at the top level?
In the data returned from (for example) https://www.sefaria.org/api/texts/Genesis.32.4-36.43?commentary=1
(which is also a spanning ref)
Not only do .he
and .text
return an array of arrays (as detailed above),
but so does .commentary
.
(This is not the case with the Shabbat example response linked by the OP)
"commentary": [
[
{"index_title": "Ibn Ezra on Genesis" ... },
... more commentaries
],
... more arrays of commentaries
]
(from https://groups.google.com/d/msg/sefaria-dev/b42MEa6Yo9Y/A79Qu3e2BQAJ)
https://www.sefaria.org/api/texts/Shabbat.74b?commentary=1 returns a bizarre structure for the comment with id=5b0d0d62f746f30055475147 (
Rosh on Shabbat 6:21-8:1
).The
he
attribute, which is normally just a string or a list of strings, returns a list whose elements are sometimes strings, and sometimes nested lists. Some of the nested lists are one string, and some are two strings. It seems like the first element is a string for the Rosh on Shabbat 6:21 reference, then each of the nested lists that follow are the values for Rosh on Shabbat 7:*, and the lists have multiple items if there are multiple segments in those refs. And then all of the individual values of Rosh on Shabbat 8:1 are individual list items in the top level list.Is this expected? I haven't seen anything else like this. I thought the
he
element was only ever a string or a list of strings. I haven't seen anything else like this so far.The
text
attribute at least seems like a consistent list of lists. Most are empty, but my guess is that they each of the nested lists maps to a single ref.@EliezerIsrael: You're right, this looks like a bug.
Looks like the error is in the function
get_links
insefaria.client.wrapper
, and impacts references that range over many sections (many super sections?)