I have found that if there are punctuation and dash characters in the text, they are not converted to clean text in text/init.py get_arpabet().
For examples, words like "recommendations.", "fbi," and "policy-making" are not searchable in the cmu_dict.
I think these will reduce model performance.
I have found that if there are punctuation and dash characters in the text, they are not converted to clean text in text/init.py get_arpabet().
For examples, words like "recommendations.", "fbi," and "policy-making" are not searchable in the cmu_dict. I think these will reduce model performance.
So I suggest some code as attached.