Sefaria / Sefaria-Project

New Interfaces for Jewish Texts
https://www.sefaria.org
657 stars 271 forks source link

Nested lists in /api/texts result #547

Open ronshapiro opened 4 years ago

ronshapiro commented 4 years ago

(from https://groups.google.com/d/msg/sefaria-dev/b42MEa6Yo9Y/A79Qu3e2BQAJ)

https://www.sefaria.org/api/texts/Shabbat.74b?commentary=1 returns a bizarre structure for the comment with id=5b0d0d62f746f30055475147 (Rosh on Shabbat 6:21-8:1).

The he attribute, which is normally just a string or a list of strings, returns a list whose elements are sometimes strings, and sometimes nested lists. Some of the nested lists are one string, and some are two strings. It seems like the first element is a string for the Rosh on Shabbat 6:21 reference, then each of the nested lists that follow are the values for Rosh on Shabbat 7:*, and the lists have multiple items if there are multiple segments in those refs. And then all of the individual values of Rosh on Shabbat 8:1 are individual list items in the top level list.

Is this expected? I haven't seen anything else like this. I thought the he element was only ever a string or a list of strings. I haven't seen anything else like this so far.

The text attribute at least seems like a consistent list of lists. Most are empty, but my guess is that they each of the nested lists maps to a single ref.

@EliezerIsrael: You're right, this looks like a bug.
Looks like the error is in the function get_links in sefaria.client.wrapper, and impacts references that range over many sections (many super sections?)

JonMosenkis commented 4 years ago

Good catch! Thanks! We'll try and get this fixed.

EliezerIsrael commented 4 years ago

I dug into this a little bit. Here's what I learned.

This bug doesn't impact our production server, because we're not using the links api to get text anymore. We get the list of links (we happen to use the related api), and then we load the text of the links closer to when they're viewed. Loading all of the link text for a section at once was too heavy.

That's how we managed to overlook this, I believe.

When we load the text with the texts API, it gets returned well. The error is a result of some optimization techniques in the bulk loading of texts-from-links that misses some edge cases.

The fix for the error isn't immediately obvious to me.

Details

It boils down to the behavior of Ref.split_spanning_ref.
According to its documentation it has this expected behavior:

            >>> Ref("Shabbat 13b:3 - 14b:3").split_spanning_ref()
            [Ref('Shabbat 13b:3-50'), Ref('Shabbat 14a'), Ref('Shabbat 14b:1-3')]

When getting text for the ranged reference, get_links() splits the reference using split_spanning_ref(), gets text for each subref, then joins them together.

In [1]: Ref("Rosh on Shabbat 6:21-8:1").split_spanning_ref()                                                                                                   

Out[1]: 
[Ref('Rosh on Shabbat 6:21'),
 Ref('Rosh on Shabbat 7'),
 Ref('Rosh on Shabbat 8:1')]

The first and last return a list of strings, with one element. The middle ref returns a list of lists of strings.

All three of those get put together into a list three elements long.

ronshapiro commented 4 years ago

What about looping subrefs() for each of split_spanning_ref()?

Or just flattening the array at the top level?

bandleader commented 3 years ago

In the data returned from (for example) https://www.sefaria.org/api/texts/Genesis.32.4-36.43?commentary=1 (which is also a spanning ref) Not only do .he and .text return an array of arrays (as detailed above), but so does .commentary.

(This is not the case with the Shabbat example response linked by the OP)

"commentary": [
  [
    {"index_title": "Ibn Ezra on Genesis" ... },
    ... more commentaries
  ],
  ... more arrays of commentaries
]