Closed oxinabox closed 5 years ago
I am interested in working on this issue, I search for a while and mainly came across the following two tweet - tokenizers.
Which of the two will be better?
The NLTK one.
This package is Apache 2, and so the Tweet NLP liscense is not compatible. We already have taken tokenizers from NLTK, and so we can do so again.
I was going through the codebase. I noticed that sed was used than Regex matching in julia. Is it because of speed performance? Should I also stick with sedbased tokenizer?
sed
isn't actually being used.
We actually generate julia code, based on the sed script. sed is basically being used as a DSL.
This is done here https://github.com/JuliaText/WordTokenizers.jl/blob/5fad6ffb3678bda8e46bc87d9aeafa65bc69d439/src/words/sedbased.jl#L9
I am nearing the completion - for now I seem to be stuck on this for a while. I need to decode the Windows-1252 encoding
(cp1252) into UTF- 8
(unicode). Any leads that I could get on this?
You do? Where are you encountering this?
Anyway, StringEncodings.jl should be what you are after
A certain limited range of numbers are interpreted by web browsers as representing in the Windows-1252 encoding.
A part of the tokenizer is - "Replacing HTML entities from the text by converting them to their corresponding unicode character". This is where I need it.
Also, thanks. I will look into StringEncodings.jl.
Oh neat! An emoticon lexer. (Not reviewed)
Twitter language tends to not like normal tokenizers much.
There are some twitter tokenizers around. So could port one of those