fuzzy searches, get_references for messy ASR

Machine Generated ASR programs like open-Ai's Whisper are on the rise and tend to output messy formatting of scripture, with difficulties in consistent int/ordinals/words for book/chapter/verse numbers, spans, and have varying capitalizations problems, etc.

Here are a handful of examples lines from webvtt/srt outputs from a batch I've run recently:

Second Timothy chapter two verses three and four says endure hardship
If you read Ephesians four 17 through 32 all the ammunition
remember that powerful message of Paul in first Corinthians nine
in Jesus's first sermonic presentation on planet earth in Matthew five through seven,
Jesus said over in Matthew chapter six, verse number 12,
Genesis four, 25.
and forth between Haggai two and Ezra three.
and go and report to John one-fifteen and thirty.
I want to focus on here is Colossians chapter three, 22 through verses through chapter four, verse one.
In 1 Corinthians 9.22, you see Paul saying
says in Mark 16 10 that the disciples were
through that fire, 1 Kings 18.24-38, 1 Chronicles 21.26, 2 Chronicles 7.1-3.
open their Bibles to first Corinthians 14, 34, 35 and say, look
Genesis 1, 26, 2, 7, and 21, 22.
look in Revelations 21, 1 through 7, you can start reading all about
Psalms 103.12 says
for one another Galatians 6 1 & 2 clearly gives us

It will take a post-processing step to clean this sort of data up for nearly anyone using these tools seriously and while feeding the inputs into an LLM or NLP tookit may make sense, it would be swell if a library like this one could do some of the heavy lifting to normalize scripture referenced in a string. Tall order/deep rabbit hole, I understand, but worth a shot.

Suggest a reformat_fuzzy_references that returns (attempts) a reformatted input_string with even a subset of the most common speech patterns into a normalized form. Bonus points if the user can have some configuration control on output styles, e.g. omit "chapter" or use "v./vv."

Assumed gotchas:

Strings may contain other semi-formatted numbers that a simple regex search may false flag upon:
- I was just in class at 8.30 with my friend Wilson
- We're going to talk at 3.30 this afternoon about the discipline of grace and there is
- So in Acts chapter 2, 3,000 were saved.

avendesora / pythonbible

fuzzy searches, get_references for messy ASR #119