Sefaria / Sefaria-Project

New Interfaces for Jewish Texts
https://www.sefaria.org
657 stars 272 forks source link

Auto-Linker Citation Formats #577

Open MDjavaheri opened 4 years ago

MDjavaheri commented 4 years ago

Basically, the linker could be optimized with aliases and more accurate targeting to more correctly link sources to Sefaria. Many times, Sefarim on Shas and Shulchan Aruch are not linked correctly, because the linker does not target anything more than the Masechet and Daf Number/Chelek of Shulchan Aruch and Siman:Seif. Plus, there are some sefarim that just don't get linked.

When citing a sefer that follows the order of Shulchan Aruch, such as "Kitzur Shulchan Aruch Orach Chaim 7:1" the linker always assumes the sefer is Shulchan Aruch, and only includes "Orach Chaim 7:1" in the link. Similarly, Kaf HaChaim Orach Chaim 47:34, only links from Orach Chaim and on, leading to an incorrect link.

Could someone rewrite the citation regexes to account for situations like this? Meaning, include some lookbehinds to ignore matches that have certain sefer names before them. For example: (?<!=Kitzur|Aruch HaShulchan|Kaf HaChaim|Taz|Be'er Heitev)\s?(Shulchan Aruch )?(Orach Chaim|Yoreh Deah|Even HaEzer|Choshen Mishpat) \d+(:\d+)+

Cases to consider would be:

The same is true for Mefarshei HaShas, such as Tosafot Berachot 11a.

It would also be great if the linker could be updated to support the following citation formats.

Take a look at https://halachipedia.com/index.php?title=Birchot_HaTorah for plenty of examples.

EliezerIsrael commented 4 years ago

Thanks for the feedback.

We certainly can be reducing the number of false matches, and catching more that we don't yet catch at all. I don't agree with that lookbehind strategy, though. We have many (most?) of the books cited this way, and all of the ones mentioned as examples. We should aim to capture as many as we can. The work of maintaining a "don't match" list seems like giving up on those.

Many of these come down to us being over reliant on commas. That's the long hanging fruit.
For example, and relating to your example - Kaf HaChaim, Orach Chaim 47:34 does match, but Kaf HaChaim Orach Chaim 47:34 does not.

I've added these example to a test document that we use: https://github.com/Sefaria/Sefaria-Project/blob/master/data/linker_test.html

MDjavaheri commented 4 years ago

Thanks for the response.

I hear your point about the lookbehind. Wow, what a difference a comma can make! ישועת ה' כהרף עין! At the same time, two issues arise. Commas can get numerous, so parenthesis like Shulchan Aruch (Orach Chaim 47:1) can be preferred, and, while that captures the Kaf HaChaim part, one still has to add a :1 to the end (Kaf HaChaim, Orach Chaim 47:34:1 to get it to show properly.

MDjavaheri commented 4 years ago

Also, @ikesultan notes how Kaf HaChaim 46:49:1 is live on https://halachipedia.com/index.php?title=Birchot_HaShachar#cite_note-48, but I would point out this is ambiguous. Kaf HaChaim covers the first 119 Simanim of Yoreh Deah also.

MDjavaheri commented 4 years ago

And stam, aliases for different spellings of a sefer name will be helpful for things like Mishna Brurah vs. Mishna Berura vs. Mishnah Berurah. Same for HaChayim and HaChaim

nsantacruz commented 4 years ago

Regarding aliases, the linker already takes into account aliases. The only issue is finding all of the aliases for a given book. In your example above, we have Mishna Brurah as an alias but not Mishna Berura.

Regarding your original point, as @EliezerIsrael mentioned, the linker is currently a bit rigid. I misses certain obvious deviations from the format we're expecting that humans easily recognize. This is something we hope to fix in the future, although the exact solution isn't obvious.

MDjavaheri commented 4 years ago

Ok, thank you, Noah. תזכו למצוות!

MDjavaheri commented 4 years ago

I'm happy to post more suggestions here if you guys are open to it.

nsantacruz commented 4 years ago

We're happy to have more suggestions! Feel free to open other issues as relevant.

MDjavaheri commented 4 years ago
MDjavaheri commented 4 years ago

Nekudat HaKesef = Nekudot HaKasef (that's the correct spelling)