Open Shiiiijie opened 2 years ago
Yes, you are right. Because our data originates from the results of OCR and does not exist in one whole paragraph, it cannot simply be split by spaces. The text can be restored to its original form as much as possible by removing end-of-line newlines and hyphens and linking them together. This will solve the above problem.
I see that the previous text cleaning process was splitting words by spaces, is this not accurate enough? This is because the original txt file has a number of words that are broken, which can lead to these words being incorrectly split into two words.