boun-tabi-LMG / turkish-academic-text-harvest

MIT License
4 stars 0 forks source link

Clean and standardize punctuation #10

Closed zeynepyirmibes closed 1 year ago

zeynepyirmibes commented 1 year ago

We should standardize non-standard punctuation and remove end-of-sentence hyphens (bilgi veril- mektedir.).

zeynepyirmibes commented 1 year ago

We will delete all hyphens (-) at the end of a line, add a flag to that line and concatenate with the next line without space.

furkanakkurt1335 commented 1 year ago

This may be helpful: https://huggingface.co/learn/nlp-course/chapter6/4 , recommended by @onurgu .

zeynepyirmibes commented 1 year ago

Cleaning and standardizing punctuation is implemented in normalize.py