avendesora / pythonbible

A python library for validating, parsing, normalizing scripture references and retrieving scripture texts (for open source and public domain versions)
https://docs.python.bible
MIT License
57 stars 11 forks source link

fuzzy searches, get_references for messy ASR #119

Closed whicks1 closed 11 months ago

whicks1 commented 11 months ago

Machine Generated ASR programs like open-Ai's Whisper are on the rise and tend to output messy formatting of scripture, with difficulties in consistent int/ordinals/words for book/chapter/verse numbers, spans, and have varying capitalizations problems, etc.

Here are a handful of examples lines from webvtt/srt outputs from a batch I've run recently:

It will take a post-processing step to clean this sort of data up for nearly anyone using these tools seriously and while feeding the inputs into an LLM or NLP tookit may make sense, it would be swell if a library like this one could do some of the heavy lifting to normalize scripture referenced in a string. Tall order/deep rabbit hole, I understand, but worth a shot.

Suggest a reformat_fuzzy_references that returns (attempts) a reformatted input_string with even a subset of the most common speech patterns into a normalized form. Bonus points if the user can have some configuration control on output styles, e.g. omit "chapter" or use "v./vv."

Assumed gotchas:

avendesora commented 11 months ago

Thanks for pointing out this opportunity to improve the functionality of the pythonbible library. I honestly had not considered this use case (most of my personal use cases are for searching through published texts, sometimes hundreds of years old), but this is certainly a valid use case, and there would be genuine value in being able to clean up that sort of data.

This enhancement will take some serious thought and probably discussion, so I am converting this issue into a discussion where I, and whoever else is interested in helping out, can post potential solution ideas. The solution may need to be implemented in phases as well.

Thanks again!