KichangKim / DeepDanbooru

AI based multi-label girl image classification system, implemented by using TensorFlow.
MIT License
2.65k stars 260 forks source link

Error reading tags with Unicode in them #96

Closed Kayliii closed 3 months ago

Kayliii commented 1 year ago

The dataset I am using to build the tag database and tags.txt has some letters that deepdanbooru crashes on. Specifically in my case, it does not like the letter ō, which produces the following error (abbreviated to show the relevant part):

  File "C:\Users\Kayli\AppData\Local\Programs\Python\Python310\lib\site-packages\deepdanbooru\data\dataset.py", line 7, in <genexpr>
    tags = [tag for tag in (tag.strip() for tag in tags_stream) if tag]
  File "C:\Users\Kayli\AppData\Local\Programs\Python\Python310\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 6620: character maps to <undefined>

ō is a single character encoded as c5 8d, if it gets to 8d without understanding that it's part of a the previous character, something has already gone wrong.

KichangKim commented 1 year ago

It may be text file encoding issue.

If you can modify python code, test this fix: https://github.com/KichangKim/DeepDanbooru/blob/05eb3c39b0fae43e3caf39df801615fe79b27c2f/deepdanbooru/data/dataset.py#L6

def load_tags(tags_path):
    with open(tags_path, "r") as tags_stream:
        tags = [tag for tag in (tag.strip() for tag in tags_stream) if tag]
        return tags

to

def load_tags(tags_path):
    with open(tags_path, "r", encoding="utf-8") as tags_stream:
        tags = [tag for tag in (tag.strip() for tag in tags_stream) if tag]
        return tags