JuliaText / WordTokenizers.jl

High performance tokenizers for natural language processing and other related tasks
Other
96 stars 25 forks source link

Add a Twitter tokenizer #3

Closed oxinabox closed 5 years ago

oxinabox commented 6 years ago

Twitter language tends to not like normal tokenizers much.

There are some twitter tokenizers around. So could port one of those

Ayushk4 commented 5 years ago

I am interested in working on this issue, I search for a while and mainly came across the following two tweet - tokenizers.

Which of the two will be better?

oxinabox commented 5 years ago

The NLTK one.

This package is Apache 2, and so the Tweet NLP liscense is not compatible. We already have taken tokenizers from NLTK, and so we can do so again.

Ayushk4 commented 5 years ago

I was going through the codebase. I noticed that sed was used than Regex matching in julia. Is it because of speed performance? Should I also stick with sedbased tokenizer?

oxinabox commented 5 years ago

sed isn't actually being used.

We actually generate julia code, based on the sed script. sed is basically being used as a DSL.

This is done here https://github.com/JuliaText/WordTokenizers.jl/blob/5fad6ffb3678bda8e46bc87d9aeafa65bc69d439/src/words/sedbased.jl#L9

Ayushk4 commented 5 years ago

I am nearing the completion - for now I seem to be stuck on this for a while. I need to decode the Windows-1252 encoding (cp1252) into UTF- 8 (unicode). Any leads that I could get on this?

oxinabox commented 5 years ago

You do? Where are you encountering this?

Anyway, StringEncodings.jl should be what you are after

Ayushk4 commented 5 years ago

A certain limited range of numbers are interpreted by web browsers as representing in the Windows-1252 encoding.

A part of the tokenizer is - "Replacing HTML entities from the text by converting them to their corresponding unicode character". This is where I need it.

Also, thanks. I will look into StringEncodings.jl.

oxinabox commented 5 years ago

Oh neat! An emoticon lexer. (Not reviewed)