alerque / stack-verse-mapper

Index Bible verse references in Stack Exchange data dumps.
https://alerque.github.io/stack-verse-mapper
GNU Lesser General Public License v3.0
6 stars 0 forks source link

Can translation URLs practically be extracted from posts? #27

Open curiousdannii opened 8 years ago

curiousdannii commented 8 years ago

I've tried extracting translation URLs from posts, but there are just too many problematic ways of formatting links which means that references get left out or their verse numbers are corrupted. Here are some examples:

<a href="http://www.mechon-mamre.org/p/pt/pt0233.htm" rel="nofollow">Exodus 33</a>:21-23a.
<a href="http://www.mechon-mamre.org/p/pt/pt0233.htm" rel="nofollow">Exodus 33</a>:21-34:5.
<a href="http://www.mechon-mamre.org/p/pt/pt0433.htm" rel="nofollow">Numbers 33</a>:1–49
<a href="http://jw.org" rel="nofollow">Gen. 21</a>: 6.
<a href="http://jw.org" rel="nofollow">Gen. 22 : </a> 6.
<a href="http://jw.org" rel="nofollow">Gen. 23 </a> :  6- 7.
<a href="http://jw.org" rel="nofollow">Gen. 24 :7- </a> 8.
<a href="http://jw.org" rel="nofollow">Gen. 25 </a>: 7 - 26: 4.
<a href="http://jw.org" rel="nofollow">Gen. 26 </a>.7 - 26.9.
<a href="http://jw.org" rel="nofollow">Gen. 27 </a>. 9 is an example of
<a href="http://www.chabad.org/library/bible_cdo/aid/16478" rel="nofollow">Esther 5</a>:4–8
<a href="http://www.mechon-mamre.org/p/pt/pt0410.htm#3">Num. 10:3</a>ff
<a href="http://www.mechon-mamre.org/p/pt/pt09a04.htm#11">I Kings 4:11,</a><a href="http://www.mechon-mamre.org/p/pt/pt09a04.htm#15">15</a>
Exodus <a href="http://mechon-mamre.org/i/t/t0221.htm">21:1</a> - <a href="http://mechon-mamre.org/i/t/t0224.htm">24:18</a>

Can anyone think of a way in which we can preserve the translations while not getting tripped up by these kind of references?


One idea I had would be to extract the URLs, but rather than inserting them translation into the body of the post (and therefore potentially breaking up a reference) we would store and count the top translation per post. We would then apply that translation to all references in the post. This would help with the major reason why I wanted this kind of functionality in the first place, which is correctly reversifying the NJPS. There are many posts which have multiple translations, which this strategy could then get wrong, but this would be better than nothing. Thoughts?

curiousdannii commented 8 years ago

I've implemented what I wrote in the last paragraph, but I'll leave this open for a little while in case anyone else has some thoughts.

alerque commented 7 years ago

Is it just me or do these mostly look like crappy semantics to start with and we should probably just use the chance to fix the data on the SE site where it's coming from? How many instances are we dealing with where only part of a reference is linked?

curiousdannii commented 7 years ago

Several of the ones I posted above are ones I just made up, but the bottom four are real if I remember correctly. I don't know how many there are, probably a low percentage, but that still probably means thousands. d91495ca had 28K deletions, many of which would be due to this indexing change, but some of which were posts which shouldn't have been indexed in the first place, so it's hard to come up with any reasonable estimate.