Closed Ayushk4 closed 5 years ago
Merging #13 into master will decrease coverage by
4.98%
. The diff coverage is72%
.
@@ Coverage Diff @@
## master #13 +/- ##
==========================================
- Coverage 80.86% 75.88% -4.99%
==========================================
Files 9 10 +1
Lines 277 651 +374
==========================================
+ Hits 224 494 +270
- Misses 53 157 +104
Impacted Files | Coverage Δ | |
---|---|---|
src/split_api.jl | 0% <ø> (ø) |
:arrow_up: |
src/words/fast.jl | 81.81% <66.66%> (+0.56%) |
:arrow_up: |
src/words/tweet_tokenizer.jl | 72.04% <72.04%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 4d877dc...c7bd296. Read the comment docs.
for record keeping purposes this is to address #3
Should we host documentation for this repository similar to https://juliatext.github.io/TextAnalysis.jl/? I feel that this will grow as more tokenizers will be added, so maybe we can have examples on each.
My feeling is that tokenizers will remain simple enough that they can all go in the readme. At the point in which this changes, then we can look at a solution like Documenter.jl and a documentation page. But I find in general dealing with that is a surprisingly constant source of complexity to a project.
You'll need to rebase this should be a simple rebase. Also how do you feel about restructuring this to use the fast new tokenbuffer that was just merged. https://github.com/JuliaText/WordTokenizers.jl/blob/master/src/words/fast.jl
I need more time to think about improving the performance of this tokenizer. I have been thinking that there may be a way to do it. After looking at fast.jl and new version of nltk_tokenizer, I believe that a similar approach may be possible. This tokenizer has more features than the nltk tokenizer. I am having difficulty with exactly visualizing a way to improve the performance, without compromising on any one feature. I also believe that this tokenizer can be improved to incorporate some features of nltk_tokenize ( with a little compromise on speed for more robust tokenizing ).
I'm adding the unit tests for the current code, and make some minor changes to improve readability and minor performance tweaks. I will get back working on this soon as it strikes me.
I'm interested in working on adding dependency/constituency parsing. I will be working on that in the meantime.
I have added the tests and I have made the specified changes.
I have added the tests, and the suggested changes have been made.
I need more time to think about improving the performance of this tokenizer. I have been thinking that there may be a way to do it. After looking at fast.jl and new version of nltk_tokenizer, I believe that a similar approach may be possible. This tokenizer has more features than the nltk tokenizer. I am having difficulty with exactly visualizing a way to improve the performance, without compromising on any one feature. I also believe that this tokenizer can be improved to incorporate some features of nltk_tokenize ( with a little compromise on speed for more robust tokenizing ).
I've been thinking about this for a while. What we could do is get a couple of twitter dumps (ideally from different sources) then add some extra stuff to count how often each feature of this tokenizer is actually used when we tokenize those twitter corpora. Then we could remove features that are not used. E.g. I suspect that a lot of the stuff to handle weird encoding things is not common anymore.
But that does not help I think with translating a lot of the regex "atoms"
Then we could remove features that are not used. E.g. I suspect that a lot of the stuff to handle weird encoding things is not common anymore.
This might very well be the case. While adding the tokenizer, I intended to implement it, as-is with nltk tweet_tokenizer, but faster. I do think that some features may not be much used.
What we could do is get a couple of twitter dumps (ideally from different sources) then add some extra stuff to count how often each feature of this tokenizer is actually used when we tokenize those twitter corpora.
This seems promising. I will start working on it and update you with the progress.
I found some datasets on Kaggle made of some recent tweets. I looked up Twitter API endpoints. Through the Twitter Search API, I can get some recent tweets containing the keyword searched for. I will be evaluating the Regex on the Kaggle tweets, but how should I go about with the latter one - via the Search API, given that it only gives me back the tweets with the keywords that I searched for?
Let's do a search for Emoji and Unicode and "Zango Text" and "Twitter Error"
I feel like that would be good.
I have obtained some recent tweets via the Twitter Search API. No tweets were found by the search query "Zango text" in the past 7-day window. I also have added a couple of search queries on my own. I have my work over here - https://github.com/Ayushk4/Tweet_tok_analyse. I will get back with the details on how often each feature of this tokenizer is used.
Tweets_Extractor.ipynb
contains the code used to search tweets. The keys will have to be regenerated for accessing the Twitter Search API. Twitter_API_tweets.json
has the result stored in a JSON file.
Please give suggestions, if any?
This looks sensible to me. (though that JSON file seems like it has been over-escaped?)
Also you committed your API creditials to github.
They are now compromised. You need to void them on the twitter website.
(I suggest in future, putting such things into enviroment variables and reading them out with ENV["TWITTER_API_KEY"]
).
I regenerated my API credentials on Twitter before committing, so the current ones on GitHub have been nullified. I will use environment variables from next time. Thank you for the suggestion.
Sounds like you are already on top of it. :-D
I have analysed the frequency of the features. You can find them here - https://github.com/Ayushk4/Tweet_tok_analyse/blob/master/Analyse_Features.ipynb
The features I am considered so far are:
- Parameter Inside Replace HTML Entities
1. Entity_Not_Numeric # This could be something like & => &
2. Entity_Base_10 # Is Base 10 numeric encoded, Like this "Δ" => "Δ"
3. Entity_Base_16 # Is Base 16 numeric encoded, Like this "Δ" => "Δ"
4. Entity_WINDOWS_1252 # Is the entity being interpreted by the browser as Windows-1252 Encoding
- Parameter Outside Replace HTML Entities
5. Strip_handles
6. Reduce_len
7. Safe_Text
Here is a short summary based on the tweets so far -
No tweet use the following feature. It convert HTML_elements in decimal
format to Unicode character. For example - Decimal : "Δ" => "Δ",
https://github.com/Ayushk4/WordTokenizers.jl/blob/4ec3f0acedc2df2150be7de25252b6d8be0c9be8/src/words/tweet_tokenizer.jl#L149
No tweet use the following feature, which is obvious from the observation above. It converts HTML_elements in decimal format and in range (128 to 159) to WINDOWS-1252 encoding. For example "†" => "†"
https://github.com/Ayushk4/WordTokenizers.jl/blob/4ec3f0acedc2df2150be7de25252b6d8be0c9be8/src/words/tweet_tokenizer.jl#L169
No tweet use the following feature as well. It convert HTML_elements in hexadecimal
format to Unicode character. For example - Hex : "Δ" => "Δ"
https://github.com/Ayushk4/WordTokenizers.jl/blob/4ec3f0acedc2df2150be7de25252b6d8be0c9be8/src/words/tweet_tokenizer.jl#L157
I have yet to check the frequencies for the individual regexes in the main tokenizing regex. I will update you when it is done.
Please give suggestions.
Also pardon for the overkill with the plots, if they were not needed.
This is great work, it aligns with my theory that in the early days of the twitter API Unicode/encoding issues were common, but that all seem to have been resolved now. That notebook might make the basis of an interesting blog post. Particularly, if you can find some more history.
I am done analysing the count for the Regex Patterns that the main tokenizing Regex uses. This main tokenizing pattern is made by joining multiple smaller pattern like EMAIL_ADDRESSES
or URLS
. Usage of each has been counted.
All the Regex Patterns were used at least once.
Some regex patterns like ASCII- ARROWS ( like -->
) and HTML_TAGS were used less.
The result show that WORDS_WITHOUT_APOSTROPHE_DASHES
regex pattern was used less than WORDS_WITH_APOSTROPHE_DASHES
, this however could be because some of the matching patterns of the latter might be common with the former.
PHONE_NUMBERS-16
EMOTICONS_REGEX-119
HTML_TAGS-11
ASCII_ARROWS-16
TWITTER_USERNAME-3
TWITTER_HASHTAGS-1783
EMAIL_ADDRESSES-5
NUMBERS_FRACTIONS_DECIMALS-128
ELLIPSIS_DOTS-502
Here is the notebook, which contains the results.
I was working on improving the speed of the tokenizer. I took a dataset and ran the tokenizer on it. I analysed the time taken.
I broke the tokenizer and its features can be broadly into these parts -
Average time by tokenizer = 271 μs
Average time by HTML Entities Feature = 116 μs
Average time by Key tokenizing Regex = 77 μs
The optional parameter of strip_handle and Reduce_len take = 30 μs both together
I think it may be best to start with improving the performance of the HTML Entities feature. I think I have a rough idea of what the implementation may look like. I will be using the tokenBuffer by MikeInnes with some additional functions and then run key tokenizer on this. Please suggest your opinions, and how should I proceed.
I'll have a think and get back to you.
For context can you compare tokenization with nltk_word_tokenize
?
Obviously not exactly the same tokenizier but gives an order or magnitude for how fast we can go using the TokenBuffer
I have looked at the new nltk_word_tokenize
code and understand how it has been implemented with the TokenBuffer
. I will start working on minimizing the time taken for Replace_HTML_Entities
function, and reach you out it updates on it.
Sounds good. I was suggesting benchmarking the timing.
I have successfully added the fast TokenBuffer
for the replace_HTML_Entities
feature.
I used @benchmark
for a function that tokenizes the Kaggle's Tweet - Emoji Competition Dataset containing 70,000 tweets. I also, separately benchmarked - replace_HTML_Entities
function for the entire dataset. The following were the results.
Tweet_tokenize function ( with both strip_handle
and reduce_len
optional features kept as true )
minimum time: 1.059 s (37.04% GC)
median time: 1.150 s (41.29% GC)
mean time: 1.141 s (40.68% GC)
maximum time: 1.199 s (44.56% GC)
Replace_HTML_Entites function
minimum time: 217.694 ms (17.89% GC)
median time: 245.647 ms (21.04% GC)
mean time: 279.942 ms (26.02% GC)
maximum time: 458.263 ms (29.89% GC)
Average 4 μs for Replace_HTML_Entities
and 16 μs for Tweet_tokenize
functions.
I have uploaded the code as a gist, over here
Also, with this we can drop the dependency on StringEncodings
Please comment on this. Should I push the new faster tokenizer?
yes, faster and with less dependencies is good. That is 4μs vs 116 μs so it is about 30x faster. Nice
Now all that remains is the Main Tokenizing Regex to be taken care of :)
@oxinabox I just saw that ntlk_word.jl
also has 2 spaces indentation instead of 4 spaces
. Should I should a separate patch or include it in this one. Which one of these two do you recommend?
seperate. Thanks. Need a style linter
I am working on making the main tokenising regex faster. To do this, I have gone through its regex pattern. I am trying to adopt a similar strategy to the one used for Html_entities and pre_processing.
However this regex and a lot larger and complex - See here.
It is highly probable that I might accidentally miss some functionality if I start writing the code.
Can you please give suggestions on this? What will be the best approach in this scenario?
One possible approach I can think of could be increasing the coverage of the tests as I go through each pattern that makes up the main regex.
Yes, many tests. One thing to do, might be to take the current datasets of real tweats you worked with before. use them as test cases. I think it would bne good to use ReferenceTests.jl for this. Which can tell you if your result has changed.
It might take a bit long to do this test, I'm not sure. But it is worth doing something like this while developing at least. We can workout putting that in to the CI suite later.
However this regex and a lot larger and complex
A advantage of rewriting to the other TokenBuffer API is hopefully it can be made clearer and easier to understand.
The DataType has not been written for some functions in TokenBuffer. Is there any reason for it? IIRC, adding it might (further?) improve the speed.
The DataType has not been written for some functions in TokenBuffer. Is there any reason for it? IIRC, adding it might (further?) improve the speed.
Nope, type annotations do not improve speed in julia. Everything is specialised, regardless of annoated or not by the JIT, for the types it encounters when it encounters them,
For each feature, I will test unit wise. Then integrating with the previous ones added. After this, I will test it again locally. Once these 2 testing is done, I will push that feature.
Is there any better approach you can suggest?
seems good to me
Can I make some changes in fast.jl, more specifically this function? https://github.com/JuliaText/WordTokenizers.jl/blob/4d877dc7b00ad0456fba152c1eeb559a2208d0d3/src/words/fast.jl#L217 It should support for numbers with sign +
or -
before them, for eg -21.45
, possibly with an optional parameter (?).
Also, Good News, only 5 features remain and so far 60% of the test cases pass
, the ones failing are due to the missing features.
Really nice thing about this PR is you are building all these new functions that work with the TokenBuffer
so later someone will be able to easily make their own tokenizer by mixing and matching rules.
It would be super cool if you could write some contributor docs with some tips on how to create your own tokenizer.
Sure!
I will be happy to do that. :smile:. Maybe we could get that to work in some way with the tokenize
API being planned.
I have added support for words, and shifted to Token Buffer with this PR. The tests only fail for 1 case which will be fixed once URL and Phone Numbers are done. After that, I will test more extensively for bugs and performance and update the unit tests for nearly full linewise code-coverage and also write contrib docs about creating your own tokenizer as oxinabox suggested
I have added the main URL function all the tests seem to pass. What's left to be done in the coding part are - secondary URL function
, extra_phone_numbers
and conversion to lowercase.
I have a couple of doubts here -
Should an option to convert to lowercase while tokenizing be implemented for all tokenizers / functions (i.e. including ones in fast.jl, nltk_word.jl and toktok.jl )?
Can I include the regex in docstrings for the TokenBuffer function I have implemented. This could be useful when there is more than 1 function for the same thing - for example a separate set of URL functions in toktok.jl.
Once these are done, I move on to Testing and Documentation.
Should an option to convert to lowercase while tokenizing be implemented for all tokenizers / functions (i.e. including ones in fast.jl, nltk_word.jl and toktok.jl )?
I would nto implement that option at all. It is not part of tokenization, and NLTK is being weird for allowing it.
Can I include the regex in docstrings for the TokenBuffer function I have implemented. This could be useful when there is more than 1 function for the same thing - for example a separate set of URL functions in toktok.jl.
If you want. Make sure to test them. Might even want to do it as a trait.
e.g.
regex_equiv(::typeof(extra_phone_numbers)) = r"\d+"
"""
extra_phone_numbers(tb)
Recognizes some more phone numbers.
Equivilent to the regex:
`$(regex_equiv(extra_phone_numbers))`
"""
function extra_phone_numbers(tb)
#...
end
Now only the tests and documentation part remain :)
I have made the suggested changes, you may review this PR.
Last tiny things then we can merge this.
The changes have been made and pushed. The CI tests also pass.
🎉
The tweet tokenizer has been added.
The following is how the tokenizer works:
replace_html_entities
is used to replace the html_entities (eg: "Price: £100
" becomes "Price: £100
")......
" becomes "...
" and "waaaaay
" becomes "waaay
" also the twitter handles are optionally removed.preserve_case
by default is set totrue
. If it is set tofalse
, then the tokenizer will downcase everything except for emoticons.I have 2 questions related to this -
I am currently matching the regex in
replace_html_entities
function twice for the ones that match the outer capturing group. Is there a way that I can use replace and pass all the matched subgroups into theconvert_entity_function
inside it? Refer https://github.com/Ayushk4/WordTokenizers.jl/blob/853331c43a1122690fdd32a78fd040034433c8d0/src/words/tweet_tokenizer.jl#L188 and 132-134 lines on the same file.Should we host documentation for this repository similar to https://juliatext.github.io/TextAnalysis.jl/? I feel that this will grow as more tokenizers will be added, so maybe we can have examples on each.