BigCokee / G14_Encyclopedia

UoE/DS4D/G14_Encyclopedia
2 stars 4 forks source link

On the accuracy of text cleaning. #12

Open Shiiiijie opened 2 years ago

Shiiiijie commented 2 years ago

I see that the previous text cleaning process was splitting words by spaces, is this not accurate enough? This is because the original txt file has a number of words that are broken, which can lead to these words being incorrectly split into two words.

Kkkeren7 commented 2 years ago

Yes, you are right. Because our data originates from the results of OCR and does not exist in one whole paragraph, it cannot simply be split by spaces. The text can be restored to its original form as much as possible by removing end-of-line newlines and hyphens and linking them together. This will solve the above problem.