Sefaria / Sefaria-Project

New Interfaces for Jewish Texts
https://www.sefaria.org
659 stars 273 forks source link

Identifying / Highlight subsections of a Segment #1359

Open EliezerIsrael opened 1 year ago

EliezerIsrael commented 1 year ago

In the context of a thread on refining our text API, this request came up -

We use the text api extensively to link to content that we do not have natively in our app. We make ca text api call - and present the content in a local popup within our app. The problem that we have is that amount of text that is presented many times is very long and not precise enough to help a person zero in on the correct part of the reference.

The way that we deal with this with internal links to our own content - is that besides the Page (to main entry in our database) We also include IDs of a Range of Phrases or Range of Words or a list of Multiple Phrases or multiple Words and our internal engine returns the "Page" with the words or phrases highlighted. That way you can see what the author was referring to in context of the whole source.

In our internal texts we either have ids on every word or on phrases. In the Sefaria texts you do not have ids at that level.

It would be nice if we could specify words 20-25 within a source and have those words wrapped in a tag as the selection that we could render as we please.

Originally posted by @mayerpasternak in https://github.com/Sefaria/Sefaria-Project/issues/1343#issuecomment-1521953311

EliezerIsrael commented 1 year ago

Some follow up from @mayerpasternak -

World level highlighting could be done externally – but your website numbers does not show word numbers – so our scholars are working blind – unless we create a new interface to your texts those expose the number of each word. We would also need to take your response from the text api and assign numbers to each word – to create the highlighting. We could possibly create all of this outside of your system – but I suspect that there are other users that could benefit from this – so it might make sense to add this as an option - instead of everyone building their own system.

EliezerIsrael commented 1 year ago

Thinking about the best way to highlight subphrases of a segment in Sefaria. We have an inherent limitation of our architectures that each word does not have its own ID. Given that, I see two paths (are there more?), each with some brittle elements.

  1. Specify range of word numbers
  2. Specify start and end phrases (or even the entire subphrase)

Option 1 is brittle in that the segments on Sefaria may change. Even a minor edit (Adding a dash or a missing word) will throw off the word counts. Some texts in our system are well defined and locked, but others do receive edits as we get corrections in, find better source material, etc.

Option 2 is brittle for the same reason, but less so. If the text of the opening or closing phrase change, then it can be thrown off. But in this case, it will be obvious that it is not correct, and the error can be handled. In the case of 1, the error in word count would pass silently.

Option 2 has another downside - there may be multiple matches to an opening and closing phrase. Presumedly, in all realistic cases, one can choose a long enough string to uniquely identify the passage.

Following this line of thinking, I would imagine that a request for a segment with opening and closing phrases of the subsection to be highlighted would be the best approach.

Interested to hear feedback.

ronshapiro commented 1 year ago

Option 3: pass the text that you want highlighted, together with start and end {character counts, word counts, or percentage counts}.

I have implemented this for talmud.page. For the same reasons highlighted above, I save as much spacial context as I can and hope that I can later find the right placement even if the text changes.

You can make this more robust by having a normalization step (remove punctuation, vocalization/trope). You could even use an edit-distance algorithm and cutoff for trying to find the best effort match..

But word-level IDs would be the best. Probably the best way to do this is to emit IDs as span tags without any styling (similar to the <i data-commentator="..."> approach for Shulchan Arukh. And then you could pass those IDs back. This would probably require some good tooling for the Sefaria team that manages text updates though

mayerpasternak commented 1 year ago

Yes word level IDS is the best option – IF and only IF the text is stable.

We have word tags on vilna Talmud page and could only put them in place once the text was at the point that it was stabilized.

But you definitely need to figure out how to handle changes and what is considered a word and what Is not – especially with citations and other non commentary words.

We decided never to put word ids on our Talmud notes – because they are constantly being adjusted for proper sources – so it’s a moving target.

So if the safaria texts are still evolving straight word ids is going to be a problem.

And change management is an issue. We just discovered hundreds of links to Ritva that were broken or inaccurate because – there was a “special project” to redo them and all the breakdowns changed.

That is a huge problem. Definitely give us pause about the stability of linking to sefaria.

We definitely wouldn’t invest in highlighting words – unless there was a process that was stable.

From: Ron Shapiro @.> Sent: Sunday, May 21, 2023 1:10 PM To: Sefaria/Sefaria-Project @.> Cc: Mayer Pasternak @.>; Mention @.> Subject: Re: [Sefaria/Sefaria-Project] Identifying / Highlight subsections of a Segment (Issue #1359)

Option 3: pass the text that you want highlighted, together with start and end {character counts, word counts, or percentage counts}.

I have implemented this for talmud.page. For the same reasons highlighted above, I save as much spacial context as I can and hope that I can later find the right placement even if the text changes.

You can make this more robust by having a normalization step (remove punctuation, vocalization/trope). You could even use an edit-distance algorithm and cutoff for trying to find the best effort match..

But word-level IDs would be the best. Probably the best way to do this is to emit IDs as span tags without any styling (similar to the approach for Shulchan Arukh. And then you could pass those IDs back. This would probably require some good tooling for the Sefaria team that manages text updates though

— Reply to this email directly, view it on GitHubhttps://github.com/Sefaria/Sefaria-Project/issues/1359#issuecomment-1556227820, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ATPMSCL7FZNXMTVGLQWPDLDXHJD5VANCNFSM6AAAAAAXZOZZ6I. You are receiving this because you were mentioned.Message ID: @.**@.>>