NirantK / Hinglish

Hinglish Text Classification
MIT License
30 stars 10 forks source link

Remove URL artifacts #7

Closed NirantK closed 4 years ago

NirantK commented 4 years ago

The people who preprocessed this data, don't know how to use regex to remove URLs completely.

They just removed the special characters, without removing the URL itself: https t co tsrsbu

NirantK commented 4 years ago

Ohh, that people is us. Please use the regex from fastai lib for cleaning and pre-processing text. That is better tested than this.

NirantK commented 4 years ago
  1. Keep the mentions
  2. Keep the hashtags
  3. Remove the URLS

Refer code with PR #10