OpenPecha / Botok

🏷 བོད་ཏོག [pʰøtɔk̚] Tibetan word tokenizer in Python
https://botok.readthedocs.io/
Apache License 2.0
58 stars 15 forks source link

AttributeError: 'NoneType' object has no attribute 'append' #67

Closed eroux closed 4 years ago

eroux commented 4 years ago

here's a bug:

  File "/Users/mm986/actibv2/actibpos.py", line 82, in lexiconsegment
    tokens = t.custom_pipeline('dummy', open_poti_tokenizer, actib_modifier, 'dummy')
  File "/Users/mm986/anaconda3/lib/python3.7/site-packages/botok/text/text.py", line 165, in custom_pipeline
    return self.__process(preprocessor, tokenizer, modifier, formatter, tok_params)
  File "/Users/mm986/anaconda3/lib/python3.7/site-packages/botok/text/text.py", line 181, in __process
    return pipeline.pipe_str(self.input)
  File "/Users/mm986/anaconda3/lib/python3.7/site-packages/botok/text/pipelinebase.py", line 53, in pipe_str
    elts = self.pipes["tok"][self.tok](text)
  File "/Users/mm986/actibv2/actibpos.py", line 64, in open_poti_tokenizer
    return WT.tokenize(in_str)
  File "/Users/mm986/anaconda3/lib/python3.7/site-packages/botok/tokenizers/wordtokenizer.py", line 87, in tokenize
    MergeDagdra().merge(tokens)
  File "/Users/mm986/anaconda3/lib/python3.7/site-packages/botok/modifytokens/mergedagdra.py", line 27, in merge
    merged = self.merge_with_previous_token(token0, token1)
  File "/Users/mm986/anaconda3/lib/python3.7/site-packages/botok/modifytokens/mergedagdra.py", line 54, in merge_with_previous_token
    merged = TokenMerge(token0, token1).merge()
  File "/Users/mm986/anaconda3/lib/python3.7/site-packages/botok/modifytokens/tokenmerge.py", line 19, in merge
    self.merge_attrs()
  File "/Users/mm986/anaconda3/lib/python3.7/site-packages/botok/modifytokens/tokenmerge.py", line 34, in merge_attrs
    self.__merge_syls_idx()
  File "/Users/mm986/anaconda3/lib/python3.7/site-packages/botok/modifytokens/tokenmerge.py", line 65, in __merge_syls_idx
    self.merged.syls_idx.append(new_syl)
AttributeError: 'NoneType' object has no attribute 'append'

I'll try to find a way to reproduce it easily

eroux commented 4 years ago

The bug can be reproduced with something along the lines of

WT = WordTokenizer('GMD')
print(WT.tokenize("ༀ་པ་ཊུ་"))
drupchen commented 4 years ago

I don't know if you have updated to the last botok version. Here is what I get. Token.syls_idx attribute seems normal Can you tell me what version of botok you are running ? (by the way, the GMD profile has been removed from "vanilla" botok. The trie data is now in botok-data and it only includes the POS profile)

>>> import botok
>>> botok.__version__
'0.7.3'
>>> WT = botok.WordTokenizer("POS")
Building Trie:
    /home/drupchen/.local/lib/python3.6/site-packages/botok/resources/words/ancient.tsv
    /home/drupchen/.local/lib/python3.6/site-packages/botok/resources/words/exceptions.tsv
    /home/drupchen/.local/lib/python3.6/site-packages/botok/resources/words/uncompound_lexicon.tsv
    /home/drupchen/.local/lib/python3.6/site-packages/botok/resources/words/tsikchen.tsv
    /home/drupchen/.local/lib/python3.6/site-packages/botok/resources/words/dagdra.tsv
    /home/drupchen/.local/lib/python3.6/site-packages/botok/resources/particles.tsv
(15 s.)
>>> print(WT.tokenize("ༀ་པ་ཊུ་"))
[text: "ༀ་"
char_types: |SYMBOL|TSEK|
chunk_type: SYM
start: 0
len: 2

, text: "པ་ཊུ་"
text_cleaned: "པ་ཊུ་"
text_unaffixed: "པ་ཊུ་"
syls: ["པ", "ཊུ"]
pos: OOV
lemma: པ་ཊུ་
senses: | freq: 3, affixed: False, pos: OOV, lemma: པ་ཊུ་ |
char_types: |CONS|TSEK|SKRT_CONS|VOW|TSEK|
chunk_type: TEXT
freq: 3
skrt: True
syls_idx: [[0], [2, 3]]
syls_start_end: [{'start': 0, 'end': 2}, {'start': 2, 'end': 5}]
start: 2
len: 5

]
>>> 
eroux commented 4 years ago

Sorry, the script that makes my version crash is:

from botok import Text, WordTokenizer

WT = WordTokenizer('GMD')
print(WT.tokenize("ༀ་པ་"))

more in a minute

drupchen commented 4 years ago

I was able to reproduce the bug:

>>> import botok
>>> botok.__version__
'0.7.3'
>>> WT = botok.WordTokenizer("POS")
>>> tokens = WT.tokenize("ༀ་པ་ཊུ་")
>>> botok.TokenMerge(tokens[0], tokens[1]).merge()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/drupchen/.local/lib/python3.6/site-packages/botok/modifytokens/tokenmerge.py", line 19, in merge
    self.merge_attrs()
  File "/home/drupchen/.local/lib/python3.6/site-packages/botok/modifytokens/tokenmerge.py", line 34, in merge_attrs
    self.__merge_syls_idx()
  File "/home/drupchen/.local/lib/python3.6/site-packages/botok/modifytokens/tokenmerge.py", line 86, in __merge_syls_idx
    self.merged.syls_idx.append(new_syl)
AttributeError: 'NoneType' object has no attribute 'append'

Then, looking into the first token:

>>> tokens[0]
text: "ༀ་"
char_types: |SYMBOL|TSEK|
chunk_type: SYM
start: 0
len: 2

We see there is no "textual content" with regular letters, so I guess botok is not considering there is any content that would end up in .syls_idx, because tseks get left out from the chars that end up in syllables. As you can see from the next token.

>>> tokens[1].syls_idx
[[0], [2, 3]]
>>> tokens[1].syls
[['པ'], ['ཊ', 'ུ']]

In short, it seems that you have "om" in a single character, and it ends up in the SYMBOL category, in the unicode characters table.

drupchen commented 4 years ago

This leaves me with an unexpected merge situation: merging a token containing no syllable with a token that contains syllables.

The fix is then simply:

if not self.merged.syls_idx:
    self.merged.syls_idx = []

before the line that triggered the error:

self.merged.syls_idx.append(new_syl)

The easy fix for your usecase is to normalize the OM. That will allow you to keep using the same botok version with the same profile, thus the same segmentation (the POS profile does not contain Monlam dictionary, thus much less words and will output different results)

In the meanwhile, I'll make a release with the bugfix

eroux commented 4 years ago

oh ok I see, thanks for the investigation! I don't think oM should in one character should be treated as a symbol... at least it shouldn't treated differently as the oM in multiple characters.

drupchen commented 4 years ago

that would amount to modifying this line.

The problem is that it would end up being a non-word token, unless you include it in a custom profile to create a custom trie. So anyhow its treatment will be different from the multiple-chars version.

Personally, I would prefer keeping considering it as a symbol because botok has expected fully deconstructed versions since the beginning, and I like to keep this strict policy (if I'm not wrong, it is the conclusion of our discussions at that time).

If we were to change its category from symbol, things get conceptually complicated because it falls within none of the implemented categories. Creating a new category just for that one character seems overkill to me ... or is it just me ?

I would tend to simply allow the merging of non-textual tokens with textual tokens, knowing that only textual content will end up in the .syls and .syls_idx attributes... which makes sense.

Tell me what you think of it.

eroux commented 4 years ago

I think mandating the input to have oM in expanded form is fine, but then it should throw a error with a didactic error message if it encounters a malformed input (in the form of a 0F00). It should maybe also provide a conversion function (for all relevant characters, not just oM)

drupchen commented 4 years ago

Right. but that goes beyond the purpose of botok, which is to tokenize text. Normalization and conversion either should be made into distinct projects or should end up in (the new) pybo.

eroux commented 4 years ago

ok, then an error is fine

drupchen commented 4 years ago

After a second thought, complementing the unicode table with all the existing non-expanded characters, then adding a kind didactic message such as follows might be a good solution. The tokenization would then run without error, but the user will be notified ahead of time of the "danger".

"beware the tokenization might not be what you expect because \<char-with-left-and-right-context> has a non-expanded version of \<char>"

How does that sound ? Would you know where I can get the corresponding list of characters or how to find them ?

eroux commented 4 years ago

Sounds good yes! Here's a starting point:

https://github.com/OpenPecha/openpecha-toolkit/blob/e3dd97831fa8cc5fd7f1cd48ebbe97aff826c1fb/openpecha/formatters/formatter.py#L42

drupchen commented 4 years ago

fixed in 2793e6d5a59ac0d8bfcf2945facc35a90d762a29 and 1e748b9184bcb45ef4de06354c6e579716cfa376