issues
search
diasks2
/
pragmatic_tokenizer
A multilingual tokenizer to split a string into tokens
MIT License
90
stars
11
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Master dev/7 numbered lists
#48
abrazzini
closed
3 years ago
0
Master dev/2 strip tags
#47
giovannelli
closed
3 years ago
0
& symbol and URL's downcase
#46
giovannelli
closed
3 years ago
1
Adding rules for tokenization of words with apostrophes in french
#45
taha-yassine
closed
4 years ago
0
Replicated in Crystal
#44
watzon
opened
5 years ago
1
Non-breaking spaces should STILL be spaces
#43
wflanagan
closed
3 years ago
2
downcase: false shoudn't mean upcase for contractions
#42
sheerun
opened
5 years ago
0
Contractions don't remove dots
#41
sheerun
opened
5 years ago
0
multiple slashes within a string not properly processed
#40
maia
opened
5 years ago
0
speed improvements by optimisation of regular expressions
#39
maia
closed
6 years ago
1
lower memory usage by reducing object allocations
#38
maia
closed
6 years ago
1
NoMethodError (nil.length)
#37
maia
closed
6 years ago
2
fix deprecated warning for Ruby 2.4
#36
mmacia
closed
7 years ago
3
EMOJI_REGEX exception on JRuby
#35
Arvinje
opened
8 years ago
1
stop words not replaceable
#34
maia
closed
8 years ago
1
urls should not be downcased
#33
maia
opened
8 years ago
1
long_word_split should not split emails, urls, twitter handles
#32
maia
closed
8 years ago
1
stop words and filter languages
#31
maia
closed
8 years ago
2
unifying regex, using constants
#30
maia
closed
8 years ago
1
refactored PostProcessor
#29
maia
closed
8 years ago
5
cleanup pre_processor.rb
#28
maia
closed
8 years ago
1
Speed
#27
diasks2
closed
8 years ago
3
refactoring to style guide
#26
maia
closed
8 years ago
5
Properly detect emoticons
#25
diasks2
opened
8 years ago
2
characters test string
#24
maia
closed
6 years ago
2
mapping of similar characters (e.g. apostrophes)?
#23
maia
opened
8 years ago
1
more specs
#22
maia
closed
8 years ago
2
more specs
#21
maia
closed
8 years ago
2
Identifying emojis by unicode ranges?
#20
maia
closed
6 years ago
4
Should all TLDs be whitelisted?
#19
diasks2
opened
8 years ago
1
Definition of clean
#18
diasks2
closed
8 years ago
2
additional specs
#17
maia
closed
8 years ago
10
splitting of words with # prefix at hyphen
#16
maia
closed
8 years ago
4
classic_filter and non-acronyms
#15
maia
closed
8 years ago
1
single quotes return different result based on language setting
#14
maia
closed
8 years ago
1
remove_numbers should keep tokens that contain letters
#13
maia
closed
8 years ago
1
option :clean removes hashtags
#12
maia
closed
8 years ago
1
split long words
#11
maia
closed
8 years ago
1
three options for each kind of token
#10
maia
closed
8 years ago
5
feature overlap with pragmatic_segmenter?
#9
maia
opened
8 years ago
1
Allow user to specify abbreviations and/or stop words to be used
#8
diasks2
closed
8 years ago
1
slow loading time
#7
maia
closed
8 years ago
3
ActiveSupport::Multibyte::Chars causing NoMethodError
#6
maia
closed
8 years ago
5
option to require only specific languages?
#5
maia
opened
8 years ago
2
german contractions list
#4
maia
closed
8 years ago
2
updated german abbreviations
#3
maia
closed
8 years ago
3
options should (also) allow symbols
#2
maia
closed
8 years ago
1
additional specs
#1
maia
closed
8 years ago
12