Closed whicks1 closed 11 months ago
Thanks for pointing out this opportunity to improve the functionality of the pythonbible library. I honestly had not considered this use case (most of my personal use cases are for searching through published texts, sometimes hundreds of years old), but this is certainly a valid use case, and there would be genuine value in being able to clean up that sort of data.
This enhancement will take some serious thought and probably discussion, so I am converting this issue into a discussion where I, and whoever else is interested in helping out, can post potential solution ideas. The solution may need to be implemented in phases as well.
Thanks again!
Machine Generated ASR programs like open-Ai's Whisper are on the rise and tend to output messy formatting of scripture, with difficulties in consistent int/ordinals/words for book/chapter/verse numbers, spans, and have varying capitalizations problems, etc.
Here are a handful of examples lines from webvtt/srt outputs from a batch I've run recently:
Second Timothy chapter two verses three and four says endure hardship
If you read Ephesians four 17 through 32 all the ammunition
remember that powerful message of Paul in first Corinthians nine
in Jesus's first sermonic presentation on planet earth in Matthew five through seven,
Jesus said over in Matthew chapter six, verse number 12,
Genesis four, 25.
and forth between Haggai two and Ezra three.
and go and report to John one-fifteen and thirty.
I want to focus on here is Colossians chapter three, 22 through verses through chapter four, verse one.
In 1 Corinthians 9.22, you see Paul saying
says in Mark 16 10 that the disciples were
through that fire, 1 Kings 18.24-38, 1 Chronicles 21.26, 2 Chronicles 7.1-3.
open their Bibles to first Corinthians 14, 34, 35 and say, look
Genesis 1, 26, 2, 7, and 21, 22.
look in Revelations 21, 1 through 7, you can start reading all about
Psalms 103.12 says
for one another Galatians 6 1 & 2 clearly gives us
It will take a post-processing step to clean this sort of data up for nearly anyone using these tools seriously and while feeding the inputs into an LLM or NLP tookit may make sense, it would be swell if a library like this one could do some of the heavy lifting to normalize scripture referenced in a string. Tall order/deep rabbit hole, I understand, but worth a shot.
Suggest a
reformat_fuzzy_references
that returns (attempts) a reformatted input_string with even a subset of the most common speech patterns into a normalized form. Bonus points if the user can have some configuration control on output styles, e.g. omit "chapter" or use "v./vv."Assumed gotchas:
I was just in class at 8.30 with my friend Wilson
We're going to talk at 3.30 this afternoon about the discipline of grace and there is
So in Acts chapter 2, 3,000 were saved.