alerque / stack-verse-mapper

Index Bible verse references in Stack Exchange data dumps.
https://alerque.github.io/stack-verse-mapper
GNU Lesser General Public License v3.0
7 stars 0 forks source link

Site specific bcv parsing options #10

Open curiousdannii opened 8 years ago

curiousdannii commented 8 years ago

On the less Bible focused sites, phrases like "num 2" and "col 2" (referring to examples and columns) produce a lot of false positives.

Site specific parsing options could be used, such as turning off whole chapter references. Or maybe there's a way to match only full book names in such contexts.

alerque commented 8 years ago

I don't know of a way to force full book name rather than abbreviation matching for some formats (full chapter) but not others. Code wise it would be hard to do because of the way regular expressions are formed and optimized and alternate book names and spellings are looked for. In the way it actually matches patterns there isn't a clear distinction between "full book name" and "abbreviation". I think per-site options is going to be the way to solve that one.

scottgit commented 8 years ago

While "per site options" I think will work well, even "in site," perhaps the regular expressions should match capitalization. A "Col. 3" book reference should be capitalized (and if not, is part of the type of editing the community should be doing to clean up posts), whereas often if a column is being referenced (even on BH.SE or C.SE for some reason), it is a lower case "col 3" and as best as possible we would want those not to show up. Some would get through anyway, but less false positives the better. That's my thoughts.

curiousdannii commented 8 years ago

If that's possible that it could work.

Another idea: on the sites like history and skeptics we could look for tags like "christianity" and "judaism" and rank such posts significantly higher than those lacking them.

On a less related topic, but because the commits are linked here, I don't know if Caleb should be getting all of the referral points...

alerque commented 8 years ago

I my experience case sensitive matching on user-input data is a minefield. How could it do any harm you say? Pipin didn't know the Balrog was down there when he dropped a stone in the well either, but some depths are better left un-poked. I've looked into how normalized user-input data ends up being and there are nothing but dark things down that road. I suppose we could run some tests to get empirical evidence, but I suspect you'll find you loose more than you gain by requiring well formatted input to work.

jdshewey commented 8 years ago

Perhaps for ambiguous items you could check for quotes or > in the markdown to determine if a quote is following or in proximity to the reference and score these entries a bit higher. Then, take a sample of the quote and check against copies of various versions of the bible. If there are a high number of matches found in a version, score the finding higher, and if there are none, score it lower. You could also score higher or lower based on keyword matches (eg, KJV or King James makes it more likely it is a biblical reference and not a false-positive. As this is quite a lot of song and dance, this should probably be planned for much later.

alerque commented 8 years ago

@jdshewey Nice to see you around here and welcome to Github!

On paper than sounds reasonable, but having reviewed the user generated content from a parser's point of view, I have an idea that ever after the song and dance was through —and mark my words, that revelry would descend into feral truculence— there would be more noise and less signal than before the party started. The relationship between user input data & formatting and relevance of a reference is just too varied.