Sefaria / Sefaria-Project

New Interfaces for Jewish Texts
https://www.sefaria.org
651 stars 264 forks source link

tzuras hadaf #72

Open ndavidovics opened 10 years ago

ndavidovics commented 10 years ago

Perhaps this is working backward, but is there anything we can do place the text we have into the form of the page of Talmud, or into the form of the Mishneh Torah, or into a nice format for Tanach with commentaries in the same page view? There is a certain familiarity with this format, and it does have some advantages over clicking to a new link.

Would we have to identify every new line and width for every page of Talmud? Has this been done before? Is it worthwhile to make a building tool for this to allow others to collaborate and create the format?

blockspeiser commented 10 years ago

This has come up a number of times. It's certainly on the list of features we want, but has not been prioritized yet as something we are actually working on.

It wold be a relatively major project. I don't know of any public source that has Daf layout information for e.g, Talmud Bavli. A number of sites clearly have it (http://www.themercava.com/dafyomi) but they haven't shared the data in a way we could reuse.

Originally we thought that Daf line breaks would be the best segments to use generally for Sefaria (http://www.sefaria.org/Berakhot.2a is still broken up according to daf lines, although you can't really see it) - but we are now working with Koren to use their semantic segments instead.

It would be possible to make another copy of the Bavli and use the Sefaria segmenting system as a to crowdsource the generation of this info - but I wonder if an approach using OCR might not be better. In any case, given the architecture we currently have it would be a big project to suppot both segmenting simultaneously.

adardesign commented 9 years ago

It can be tackled via CSS with the new shape-outside property (for modern browsers) See here: http://www.html5rocks.com/en/tutorials/shapes/getting-started/

But its not so simple as we'll need data for line breaks.. and its totally not responsive friendly

yehosef commented 9 years ago

It's possible to automatically generate this layout based on the the known text for the page - but it's not 100% based on kerning, etc and requires manual tweaking. The other option is to do it manually (which is the only real way to be accurate.) You just need to build an interface where you can see the Rashi, Gemara, Tosfos next to each other and click on the words that represent beginning/end of lines. You can also set the span width for each line if you can't figure it out automatically based on the other texts. You then crowd source it - While the interface requires a little effort - the actual crowdsourcing effort is easy.

yehosef commented 9 years ago

You can look at Mercava's html to see how you can approach it - you don't need any fancy css.

blockspeiser commented 9 years ago

@adardesign This is very interesting, I didn't know about this.

Using this would make it much easier (I think) to get a first approximation of the daf layout that contains some discrepancies, but could still be very useful most of the time.

If we had a description of the geometry of each page we should just dump our text (Bavli, Rashi, Tosafot) into these layout boxes. If sized correctly, they should come out roughly correct, though I'd expect line breaks to not always match exactly (but I've found discrepancies among exact word placement per line printed editions of the vilna daf as well).

Mercava's approach is much more precise (it seems) but it requires much more granular data (mapping of individual line text to geometry) and is more complex to align with the segmentation in Sefaria.

adardesign commented 9 years ago

Yea, Makes sense, Lets collaborate on that. Don't forget the browser support http://caniuse.com/#search=shape-outside which still looks too colorful

bachrach44 commented 9 years ago

Another related project to look at, and perhaps use: https://code.google.com/p/tzura/ http://mekorot.sourceforge.net/ http://books.613m.org/

Yizchok commented 7 years ago

HebrewBooks and others are using OCR, but their text is not exact. Sefaria has exact text. there should be a way to check the not exact text by the Sefaria projectt and could have a Tzuras Hadaf option.

ckoppelman commented 4 years ago

I was thinking about this with a friend the other day. Our first draft thought experiment had something like this for the process (where an minimum viable product was just the Mishnah/Talmud, not commentary):

  1. OCR the daf layout
  2. Iterate through the Sefaria text and use probability to match two- or three-word tuples with the OCR results.

Instead of being cute and calculating the CSS borders of the daf layout, the daf layout would be an image. Sefaria text sentences would map to pixels of the daf. One advantage of this is that the punctuation in the Sefaria text would not interfere with the layout. Another is that it's relatively easy to apply the same process to the other commentaries on the page (assuming there's a decent Rashi-script OCR out there).

EliezerIsrael commented 4 years ago

Check out https://talmud.dev/

yehosef commented 4 years ago

@EliezerIsrael - cool, thanks for sharing. Is the code available somewhere? I would like to see the tzuras hadaf engine be open-source so we can all contribute.

@ckoppelman - the problem with pixels is that it's harder to make it really interactive, I think. For example, I want to be able to change the color of a font or it's underline or background based on the word (it's a Tanna or amora or a pasuk, etc)

About text lineup - One approach I'd thought about was using canvas to line up the HTML with an image so that you can adjust the HTML, capture the superimposition of the image and text and then adjust the CSS until you minimize the differences (the number of dark pixels).

But I think for the most part, that's not needed. I think it's not critical that it be pixel aligned, but that each line of Gemara/Rashi/Tosfos will have the correct started and ending words. Know that on line 1 of Rashi, it goes from word A to word B and on the second line it is from B+1 to C, etc. Then the goal of the render is to make it fit on the line it's supposed to (sometimes you need to change the word-spacing/kerning, sometimes make the font on that line 1px less, etc. to make it fit - I had a POC years ago in JQuery - but I'd have to find the code..)

nsantacruz commented 4 years ago

@yehosef The people behind talmud.dev are planning on open sourcing it by the end of the summer. Their names are Shaun Regenbaum (@Shaun-Regenbaum) and Dan Jutan. Feel free to reach out to them about it.

yehosef commented 4 years ago

@nsantacruz Thanks!

Jutanium commented 4 years ago

Hey, feel free to ask any questions here or by email.