UnicodeDecodeError: 'charmap' codec can't decode byte 0x81

dhananjaybhandiwad commented 5 months ago

Dear Author, I encountered the below error when I tried to run the script file token_grammar_recognize.py

Traceback (most recent call last):
  File "d:\transformers-CFG\transformers_cfg\token_grammar_recognizer.py", line 288, in <module>
    input_text = file.read()
  File "C:\Users\dhana\miniconda3\envs\decoding\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 163: character maps to <undefined>

This is the main file:

if __name__ == "__main__":
    from transformers import AutoTokenizer

    with open("D:/transformers-CFG/examples/grammars/japanese.ebnf", "r") as file:
        input_text = file.read()
    parsed_grammar = parse_ebnf(input_text)
    parsed_grammar.print()

    tokenizer = AutoTokenizer.from_pretrained("gpt2")

    tokenRecognizer = IncrementalTokenRecognizer(
        grammar_str=input_text,
        start_rule_name="root",
        tokenizer=tokenizer,
        unicode=True,
    )

    japanese = "トリーム"  # "こんにちは"
    token_ids = tokenizer.encode(japanese)
    # 13298, 12675, 12045, 254
    init_state = None
    state = tokenRecognizer._consume_token_ids(token_ids, init_state, as_string=False)

    if state.stacks:
        print("The Japanese input is accepted")
    else:
        print("The Japanese input is not accepted")

Please could you help me regarding this issue.

Saibo-creator commented 5 months ago

Hello @dhananjaybhandiwad,

Thank you for raising this issue. I did not encounter any problems when running your script. It might be a versioning issue. Could you please check which version of the package you are using?

On myside, by running directly from pypi using pip install transformers_cfg, I get

(transformers-cfg-pypi) ➜  transformers-CFG-dev git:(main) ✗ pip show transformers_cfg
Name: transformers_cfg
Version: 0.2.1
Summary: Extension of Transformers library for Context-Free Grammar Constrained Decoding with EBNF grammars
Home-page: https://github.com/epfl-dlab/transformers-CFG
Author: EPFL-dlab
Author-email: saibo.geng@epfl.ch
License:
Location: /opt/anaconda3/envs/transformers-cfg-pypi/lib/python3.8/site-packages
Requires: line-profiler, numpy, protobuf, sentencepiece, setuptools, termcolor, tokenizers, torch, transformers
Required-by:

dhananjaybhandiwad commented 5 months ago

Hello @Saibo-creator, I cloned the latest repo for modifying some code in the parser.py file to accomadate the grammar of SPARQL. So I tried just running the file token_grammar_recognize.py, to see how the system works and it threw me the error mentioned in the previous comment.

To see if my changes in parser.py had effected the token_grammar_recognize.py, I reverted all my changes and ran the unchanged version of parser.py, still the error persisted.

This error also persists when I try to run the parser.py file independently. Specifically while parsing the Japanese.ebnf file.

Saibo-creator commented 5 months ago

Hello @dhananjaybhandiwad , I just tried to clone the lastest version (commit 86eccd) and I was able to run your script without encountering a problem. I think maybe it's a problem due to platform(not sure at all), are you using windows ? If so, could you try WSL ?

epfl-dlab / transformers-CFG

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 #52