alerque / stack-verse-mapper

Index Bible verse references in Stack Exchange data dumps.
https://alerque.github.io/stack-verse-mapper
GNU Lesser General Public License v3.0
6 stars 0 forks source link

Support Hebrew book names #13

Open curiousdannii opened 8 years ago

curiousdannii commented 8 years ago

The Judaism site frequently uses Hebrew book names, so in the future it would be good to support them.

It looks like they sometimes also refer to them by the Parashah titles rather than the book titles too. It will be easy to map the parashas-/parashat- tags, but I don't know if we'll be able to match the body references (if there are actually any).

A related issue: Do most posters on the Judaism site use the Hebrew versification scheme? Can we deal with that somehow?

alerque commented 8 years ago

The BCV parser we're using has Hebrew book name patterns and there is a Hebrew version of the final module. We happen to be using the English version of the module, but we can mix and match. There is also a multi-lang version and we can compile our own combinations of languages.

Harder than processing the input and building the index is probably going to be making the actual search UI friendly to searching in different versification schemes and returning results that match no matter what the original scheme was. I suggest we do some sort of detection to guess what the scheme is in each input post and index them normalized to one scheme. Then from the search interface we might have to present the desired scheme as an advanced search option, but it would interpolate the search query into whatever scheme our index is normalized on.

curiousdannii commented 8 years ago

Ah, I didn't mean book names in the Hebrew script, but their Latin transliterations like Shemot (and Shmot and Shemoth, they're not at all consistently transliterated.)

The BCV parser seems to support different versifications if they're marked with a translation, so perhaps for the posts from the Judaism site we might be able to detect links to the Tanakh etc and use them to switch to a different versifier. We'll never be able to get it to work perfectly, but it might improve the results for not too much effort.

alerque commented 8 years ago

Improvement is all I would hope for (and at this point I'm not even really worried about implementing this, just having it in mind as we make architectural decisions) anyway.

But you're right I misunderstood your issue, and it doesn't look like we have have a ready-built way to handle transliterated stuff. But never fear! I've been down the road of building new languages in to the BCV parser and it isn't that hard. We can setup a new language or our own variant of the Hebrew that has alternate book names for whatever transliterations are used in the wild.