Sefaria / Sefaria-Project

New Interfaces for Jewish Texts
https://www.sefaria.org
658 stars 273 forks source link

DB Design: meta-info about texts #34

Closed blockspeiser closed 10 years ago

blockspeiser commented 12 years ago

We need to know more about the texts we have and the texts we need. This involves a few sides:

  1. Knowing information about the size and structure of text in general. E.g., knowing that Bereishit contains 50 chapters and that chapter 50 of Genesis contains 26 verses.
  2. Knowing summary information about the actual texts and translations we have in DB. E.g, being able to say, we have 100% of the the Hebrew of Bereishit across 3 versions, but that we only have 35% of Mishna Peah in English. This information may be summarized across all the texts we have.
  3. Knowing information about a particular version of a text. E.g., verse for verse knowing whether and by whom a text has been reviewed, or storing ratings for the quality of a particular translations on a segment by segment basis.

Collecting information in (1) maybe be just as difficult as actually getting the text (e.g., counting precisely how many Rashis there are on which dafs of gemara). Handling incomplete information will be a requirement. Being able to provide estimates for sizes will be very helpful for estimating the magnitude of our task.

blockspeiser commented 11 years ago

Some of this work has been accomplished (in particular bullet 2), but I'm leaving this open because someone with more experience in DB design than myself could still be very helpful to take a look and suggesting design improvements before we move on the the storing segment leven review status info.

For bullet 2, look in sefaria/counts.py. A collection called counts is now being created which stores a jagged array that matches the structure of the jagged array of the text itself. In place of strings as terminals, a count document stores a integer for the number of available versions of that segment in Hebrew and English.