Ayushk4 commented 5 years ago

The tweet tokenizer has been added.

[X] Tokenizer added
[X] Documentation
[x] Tests

The following is how the tokenizer works:

The regular expressions are made for WORD_REGEX (core tokenizer), HANG_REGEX and EMOTICONS_REGEX.
The function replace_html_entities is used to replace the html_entities (eg: "Price: £100" becomes "Price: £100")
The string is processed optionally for reducing the length of strings (like "......" becomes "..." and "waaaaay" becomes "waaay" also the twitter handles are optionally removed.
The String is tokenized.
preserve_case by default is set to true. If it is set to false, then the tokenizer will downcase everything except for emoticons.

I have 2 questions related to this -

I am currently matching the regex in replace_html_entities function twice for the ones that match the outer capturing group. Is there a way that I can use replace and pass all the matched subgroups into the convert_entity_function inside it? Refer https://github.com/Ayushk4/WordTokenizers.jl/blob/853331c43a1122690fdd32a78fd040034433c8d0/src/words/tweet_tokenizer.jl#L188 and 132-134 lines on the same file.
Should we host documentation for this repository similar to https://juliatext.github.io/TextAnalysis.jl/? I feel that this will grow as more tokenizers will be added, so maybe we can have examples on each.

codecov-io commented 5 years ago

Codecov Report

Merging #13 into master will decrease coverage by 4.98%. The diff coverage is 72%.

@@            Coverage Diff             @@
##           master      #13      +/-   ##
==========================================
- Coverage   80.86%   75.88%   -4.99%     
==========================================
  Files           9       10       +1     
  Lines         277      651     +374     
==========================================
+ Hits          224      494     +270     
- Misses         53      157     +104

Impacted Files	Coverage Δ
src/split_api.jl	`0% <ø> (ø)`	:arrow_up:
src/words/fast.jl	`81.81% <66.66%> (+0.56%)`	:arrow_up:
src/words/tweet_tokenizer.jl	`72.04% <72.04%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4d877dc...c7bd296. Read the comment docs.

oxinabox commented 5 years ago

for record keeping purposes this is to address #3

oxinabox commented 5 years ago

Should we host documentation for this repository similar to https://juliatext.github.io/TextAnalysis.jl/? I feel that this will grow as more tokenizers will be added, so maybe we can have examples on each.

My feeling is that tokenizers will remain simple enough that they can all go in the readme. At the point in which this changes, then we can look at a solution like Documenter.jl and a documentation page. But I find in general dealing with that is a surprisingly constant source of complexity to a project.

oxinabox commented 5 years ago

You'll need to rebase this should be a simple rebase. Also how do you feel about restructuring this to use the fast new tokenbuffer that was just merged. https://github.com/JuliaText/WordTokenizers.jl/blob/master/src/words/fast.jl

Ayushk4 commented 5 years ago

I need more time to think about improving the performance of this tokenizer. I have been thinking that there may be a way to do it. After looking at fast.jl and new version of nltk_tokenizer, I believe that a similar approach may be possible. This tokenizer has more features than the nltk tokenizer. I am having difficulty with exactly visualizing a way to improve the performance, without compromising on any one feature. I also believe that this tokenizer can be improved to incorporate some features of nltk_tokenize ( with a little compromise on speed for more robust tokenizing ).

I'm adding the unit tests for the current code, and make some minor changes to improve readability and minor performance tweaks. I will get back working on this soon as it strikes me.

I'm interested in working on adding dependency/constituency parsing. I will be working on that in the meantime.

Ayushk4 commented 5 years ago

I have added the tests and I have made the specified changes.

Ayushk4 commented 5 years ago

I have added the tests, and the suggested changes have been made.

oxinabox commented 5 years ago

I need more time to think about improving the performance of this tokenizer. I have been thinking that there may be a way to do it. After looking at fast.jl and new version of nltk_tokenizer, I believe that a similar approach may be possible. This tokenizer has more features than the nltk tokenizer. I am having difficulty with exactly visualizing a way to improve the performance, without compromising on any one feature. I also believe that this tokenizer can be improved to incorporate some features of nltk_tokenize ( with a little compromise on speed for more robust tokenizing ).

I've been thinking about this for a while. What we could do is get a couple of twitter dumps (ideally from different sources) then add some extra stuff to count how often each feature of this tokenizer is actually used when we tokenize those twitter corpora. Then we could remove features that are not used. E.g. I suspect that a lot of the stuff to handle weird encoding things is not common anymore.

But that does not help I think with translating a lot of the regex "atoms"

Ayushk4 commented 5 years ago

Then we could remove features that are not used. E.g. I suspect that a lot of the stuff to handle weird encoding things is not common anymore.

This might very well be the case. While adding the tokenizer, I intended to implement it, as-is with nltk tweet_tokenizer, but faster. I do think that some features may not be much used.

What we could do is get a couple of twitter dumps (ideally from different sources) then add some extra stuff to count how often each feature of this tokenizer is actually used when we tokenize those twitter corpora.

This seems promising. I will start working on it and update you with the progress.

Ayushk4 commented 5 years ago

I found some datasets on Kaggle made of some recent tweets. I looked up Twitter API endpoints. Through the Twitter Search API, I can get some recent tweets containing the keyword searched for. I will be evaluating the Regex on the Kaggle tweets, but how should I go about with the latter one - via the Search API, given that it only gives me back the tweets with the keywords that I searched for?

oxinabox commented 5 years ago

Let's do a search for Emoji and Unicode and "Zango Text" and "Twitter Error"

I feel like that would be good.

Ayushk4 commented 5 years ago

I have obtained some recent tweets via the Twitter Search API. No tweets were found by the search query "Zango text" in the past 7-day window. I also have added a couple of search queries on my own. I have my work over here - https://github.com/Ayushk4/Tweet_tok_analyse. I will get back with the details on how often each feature of this tokenizer is used.

Tweets_Extractor.ipynb contains the code used to search tweets. The keys will have to be regenerated for accessing the Twitter Search API. Twitter_API_tweets.json has the result stored in a JSON file. Please give suggestions, if any?

oxinabox commented 5 years ago

This looks sensible to me. (though that JSON file seems like it has been over-escaped?)

Also you committed your API creditials to github. They are now compromised. You need to void them on the twitter website. (I suggest in future, putting such things into enviroment variables and reading them out with ENV["TWITTER_API_KEY"]).

Ayushk4 commented 5 years ago

I regenerated my API credentials on Twitter before committing, so the current ones on GitHub have been nullified. I will use environment variables from next time. Thank you for the suggestion.

oxinabox commented 5 years ago

Sounds like you are already on top of it. :-D

Ayushk4 commented 5 years ago

I have analysed the frequency of the features. You can find them here - https://github.com/Ayushk4/Tweet_tok_analyse/blob/master/Analyse_Features.ipynb

The features I am considered so far are:
- Parameter Inside Replace HTML Entities 
1. Entity_Not_Numeric     # This could be something like &amp; => &
2. Entity_Base_10         # Is Base 10 numeric encoded, Like this "&#916;" => "Δ"
3. Entity_Base_16         # Is Base 16 numeric encoded, Like this "&#x394;" => "Δ"
4. Entity_WINDOWS_1252    # Is the entity being interpreted by the browser as Windows-1252 Encoding

- Parameter Outside Replace HTML Entities
5. Strip_handles 
6. Reduce_len 
7. Safe_Text

Here is a short summary based on the tweets so far -

No tweet use the following feature. It convert HTML_elements in decimal format to Unicode character. For example - Decimal : "Δ" => "Δ", https://github.com/Ayushk4/WordTokenizers.jl/blob/4ec3f0acedc2df2150be7de25252b6d8be0c9be8/src/words/tweet_tokenizer.jl#L149
No tweet use the following feature, which is obvious from the observation above. It converts HTML_elements in decimal format and in range (128 to 159) to WINDOWS-1252 encoding. For example "" => "†" https://github.com/Ayushk4/WordTokenizers.jl/blob/4ec3f0acedc2df2150be7de25252b6d8be0c9be8/src/words/tweet_tokenizer.jl#L169
No tweet use the following feature as well. It convert HTML_elements in hexadecimal format to Unicode character. For example - Hex : "Δ" => "Δ" https://github.com/Ayushk4/WordTokenizers.jl/blob/4ec3f0acedc2df2150be7de25252b6d8be0c9be8/src/words/tweet_tokenizer.jl#L157

I have yet to check the frequencies for the individual regexes in the main tokenizing regex. I will update you when it is done.

Please give suggestions.

Also pardon for the overkill with the plots, if they were not needed.

oxinabox commented 5 years ago

This is great work, it aligns with my theory that in the early days of the twitter API Unicode/encoding issues were common, but that all seem to have been resolved now. That notebook might make the basis of an interesting blog post. Particularly, if you can find some more history.

Ayushk4 commented 5 years ago

I am done analysing the count for the Regex Patterns that the main tokenizing Regex uses. This main tokenizing pattern is made by joining multiple smaller pattern like EMAIL_ADDRESSES or URLS. Usage of each has been counted.

All the Regex Patterns were used at least once. Some regex patterns like ASCII- ARROWS ( like -->) and HTML_TAGS were used less. The result show that WORDS_WITHOUT_APOSTROPHE_DASHES regex pattern was used less than WORDS_WITH_APOSTROPHE_DASHES, this however could be because some of the matching patterns of the latter might be common with the former.

PHONE_NUMBERS-16
EMOTICONS_REGEX-119
HTML_TAGS-11
ASCII_ARROWS-16
TWITTER_USERNAME-3
TWITTER_HASHTAGS-1783
EMAIL_ADDRESSES-5
NUMBERS_FRACTIONS_DECIMALS-128
ELLIPSIS_DOTS-502

Here is the notebook, which contains the results.

Ayushk4 commented 5 years ago

I was working on improving the speed of the tokenizer. I took a dataset and ran the tokenizer on it. I analysed the time taken.

I broke the tokenizer and its features can be broadly into these parts -

The Fix HTML entities feature called on the line #235
The optional features of strip_handle and reduce_len.
The main Tokenizer at #237
```
Average time by tokenizer = 271 μs
Average time by HTML Entities Feature =  116 μs
Average time by Key tokenizing Regex =  77 μs
The optional parameter of strip_handle and Reduce_len take = 30 μs both together
```
I think it may be best to start with improving the performance of the HTML Entities feature. I think I have a rough idea of what the implementation may look like. I will be using the tokenBuffer by MikeInnes with some additional functions and then run key tokenizer on this. Please suggest your opinions, and how should I proceed.

oxinabox commented 5 years ago

I'll have a think and get back to you.

For context can you compare tokenization with nltk_word_tokenize? Obviously not exactly the same tokenizier but gives an order or magnitude for how fast we can go using the TokenBuffer

Ayushk4 commented 5 years ago

I have looked at the new nltk_word_tokenize code and understand how it has been implemented with the TokenBuffer. I will start working on minimizing the time taken for Replace_HTML_Entities function, and reach you out it updates on it.

oxinabox commented 5 years ago

Sounds good. I was suggesting benchmarking the timing.

Ayushk4 commented 5 years ago

I have successfully added the fast TokenBuffer for the replace_HTML_Entities feature.

I used @benchmark for a function that tokenizes the Kaggle's Tweet - Emoji Competition Dataset containing 70,000 tweets. I also, separately benchmarked - replace_HTML_Entities function for the entire dataset. The following were the results.

Tweet_tokenize function ( with both strip_handle and reduce_len optional features kept as true )

minimum time:     1.059 s (37.04% GC)
median time:      1.150 s (41.29% GC)
mean time:        1.141 s (40.68% GC)
maximum time:     1.199 s (44.56% GC)

Replace_HTML_Entites function

minimum time:     217.694 ms (17.89% GC)
median time:      245.647 ms (21.04% GC)
mean time:        279.942 ms (26.02% GC)
maximum time:     458.263 ms (29.89% GC)

Average 4 μs for Replace_HTML_Entities and 16 μs for Tweet_tokenize functions.

I have uploaded the code as a gist, over here

Also, with this we can drop the dependency on StringEncodings

Please comment on this. Should I push the new faster tokenizer?

oxinabox commented 5 years ago

yes, faster and with less dependencies is good. That is 4μs vs 116 μs so it is about 30x faster. Nice

Ayushk4 commented 5 years ago

Now all that remains is the Main Tokenizing Regex to be taken care of :)

Ayushk4 commented 5 years ago

@oxinabox I just saw that ntlk_word.jl also has 2 spaces indentation instead of 4 spaces. Should I should a separate patch or include it in this one. Which one of these two do you recommend?

oxinabox commented 5 years ago

seperate. Thanks. Need a style linter

Ayushk4 commented 5 years ago

I am working on making the main tokenising regex faster. To do this, I have gone through its regex pattern. I am trying to adopt a similar strategy to the one used for Html_entities and pre_processing.
However this regex and a lot larger and complex - See here.

It is highly probable that I might accidentally miss some functionality if I start writing the code.

Can you please give suggestions on this? What will be the best approach in this scenario?

One possible approach I can think of could be increasing the coverage of the tests as I go through each pattern that makes up the main regex.

oxinabox commented 5 years ago

Yes, many tests. One thing to do, might be to take the current datasets of real tweats you worked with before. use them as test cases. I think it would bne good to use ReferenceTests.jl for this. Which can tell you if your result has changed.

It might take a bit long to do this test, I'm not sure. But it is worth doing something like this while developing at least. We can workout putting that in to the CI suite later.

However this regex and a lot larger and complex

A advantage of rewriting to the other TokenBuffer API is hopefully it can be made clearer and easier to understand.

Ayushk4 commented 5 years ago

The DataType has not been written for some functions in TokenBuffer. Is there any reason for it? IIRC, adding it might (further?) improve the speed.

https://github.com/JuliaText/WordTokenizers.jl/blob/4d877dc7b00ad0456fba152c1eeb559a2208d0d3/src/words/fast.jl#L56

https://github.com/JuliaText/WordTokenizers.jl/blob/4d877dc7b00ad0456fba152c1eeb559a2208d0d3/src/words/fast.jl#L115

oxinabox commented 5 years ago

The DataType has not been written for some functions in TokenBuffer. Is there any reason for it? IIRC, adding it might (further?) improve the speed.

Nope, type annotations do not improve speed in julia. Everything is specialised, regardless of annoated or not by the JIT, for the types it encounters when it encounters them,

Ayushk4 commented 5 years ago

For each feature, I will test unit wise. Then integrating with the previous ones added. After this, I will test it again locally. Once these 2 testing is done, I will push that feature.

Is there any better approach you can suggest?

oxinabox commented 5 years ago

seems good to me

Ayushk4 commented 5 years ago

Can I make some changes in fast.jl, more specifically this function? https://github.com/JuliaText/WordTokenizers.jl/blob/4d877dc7b00ad0456fba152c1eeb559a2208d0d3/src/words/fast.jl#L217 It should support for numbers with sign + or - before them, for eg -21.45, possibly with an optional parameter (?).

Also, Good News, only 5 features remain and so far 60% of the test cases pass, the ones failing are due to the missing features.

oxinabox commented 5 years ago

Really nice thing about this PR is you are building all these new functions that work with the TokenBuffer so later someone will be able to easily make their own tokenizer by mixing and matching rules. It would be super cool if you could write some contributor docs with some tips on how to create your own tokenizer.

Ayushk4 commented 5 years ago

Sure! I will be happy to do that. :smile:. Maybe we could get that to work in some way with the tokenize API being planned.

Ayushk4 commented 5 years ago

I have added support for words, and shifted to Token Buffer with this PR. The tests only fail for 1 case which will be fixed once URL and Phone Numbers are done. After that, I will test more extensively for bugs and performance and update the unit tests for nearly full linewise code-coverage and also write contrib docs about creating your own tokenizer as oxinabox suggested

Ayushk4 commented 5 years ago

I have added the main URL function all the tests seem to pass. What's left to be done in the coding part are - secondary URL function, extra_phone_numbers and conversion to lowercase.

I have a couple of doubts here -

Should an option to convert to lowercase while tokenizing be implemented for all tokenizers / functions (i.e. including ones in fast.jl, nltk_word.jl and toktok.jl )?
Can I include the regex in docstrings for the TokenBuffer function I have implemented. This could be useful when there is more than 1 function for the same thing - for example a separate set of URL functions in toktok.jl.

Once these are done, I move on to Testing and Documentation.

oxinabox commented 5 years ago

Should an option to convert to lowercase while tokenizing be implemented for all tokenizers / functions (i.e. including ones in fast.jl, nltk_word.jl and toktok.jl )?

I would nto implement that option at all. It is not part of tokenization, and NLTK is being weird for allowing it.

Can I include the regex in docstrings for the TokenBuffer function I have implemented. This could be useful when there is more than 1 function for the same thing - for example a separate set of URL functions in toktok.jl.

If you want. Make sure to test them. Might even want to do it as a trait.

e.g.


regex_equiv(::typeof(extra_phone_numbers)) = r"\d+"

"""
extra_phone_numbers(tb)

Recognizes some more phone numbers.
Equivilent to the regex:
`$(regex_equiv(extra_phone_numbers))`
"""
function extra_phone_numbers(tb)
#...
end

Ayushk4 commented 5 years ago

Now only the tests and documentation part remain :)

Ayushk4 commented 5 years ago

I have made the suggested changes, you may review this PR.

oxinabox commented 5 years ago

Last tiny things then we can merge this.

Ayushk4 commented 5 years ago

The changes have been made and pushed. The CI tests also pass.

oxinabox commented 5 years ago

🎉

JuliaText / WordTokenizers.jl

Add Tweet Tokenizer #13

The following is how the tokenizer works:

I have 2 questions related to this -

Codecov Report