Preprocessing custom dataset without removing punctuation

MIND-Lab / OCTIS

OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

MIT License

705 stars 98 forks source link

Preprocessing custom dataset without removing punctuation #115

Open ninavdPipple opened 6 months ago

ninavdPipple commented 6 months ago

Hi, I'm trying to load a custom dataset without removing the punctuation. However, if I set remove_punctuation = False, still all punctuation is removed and even worse; words connected to any punctuation are also gone. For example, 'Good evening!' simply becomes 'Good' in the corpus. How can I fix this? Ideally I want to remove all punctuation except '<' and '>', but I cannot come to any configuration where some punctuation is left at all. Thanks in advance! Nina

ninavdPipple commented 6 months ago

I figured this has to do with the fact that inside the preprocessing a vocabulary is created in which automatically all punctuation is removed. By ignoring the vocabulary, this could be avoided.