some statements have trailing punctuation and capitalization (so we aren't cleaning them correctly somewhere in the pipeline)
when statements do not have those things, they are displayed without them on the platform.
I think the behavior we would like is that statements should be stored in a consistent state (i.e., all have trailing punctuation and capitalization or all not have those things) and displayed in a consistent way (i.e., always shown with the first letter capitalized and a period at the end).
Additionally, I think we want to clean up any existing statements that have these issues without breaking IDs.
To be finished, we should have a test in place that checks that statements are correctly formatted after ingestion and are correctly formatted at render time.
clean entire db once as a PR to the statements repo, Use raw statements for translation, then clean english language and foreign language statements after translation. May depend on language?
set up for part of the pipeline for statement ingestion
blockers:
~understanding how local, dev, prod dbs relate and how we can move stuff to prod?~
@dankim444 how should things look in other languages? Some languages don't use capitalization, punctuation, in the same way.
Decide where to do this. Should we specify this in the cleaning utility function? Probably not render, but maybe some object that describes the mutations?
Goals is to make the presentation consistent, more than to make it any one specific format.
currently dealing with edge cases that have to do with \"escaping" differences between different services that touch the statements (e.g. A string with a quote in it, and maybe some apostrophes, and a cursed backtick)
starting with NLP libraries to check
will complete by Friday (hopefully)
One idea is to run it through a language model, that should handle most situations.
Think about types of errors that we have, and figure out ways to detect and fix them specifically.
Count the number of times we see each type of issue and decide whether to make a rule or just manually fix them depending on how many there are.
This likely has 2 layers:
I think the behavior we would like is that statements should be stored in a consistent state (i.e., all have trailing punctuation and capitalization or all not have those things) and displayed in a consistent way (i.e., always shown with the first letter capitalized and a period at the end).
Additionally, I think we want to clean up any existing statements that have these issues without breaking IDs.
To be finished, we should have a test in place that checks that statements are correctly formatted after ingestion and are correctly formatted at render time.