Error when encoding cpu_texts with custom dataset

tumbleintoyourheart commented 5 years ago

Hi Holmeyoung, I face this error when running train.py with a custom dataset Annotation 2019-07-22 062747

I try text = b''.join(text) and it turn into another problem Annotation 2019-07-22 063226

My question is: which is the proper type of cpu_texts (tuple of str or tuple of bytes) I think that my custom lmdb dataset might be the problem, because cpu_images, cpu_texts = data returns tuple of bytes

Holmeyoung commented 5 years ago

Hi, the data runs in the code in a stable way. And you must change your data the be the same with mine so the code can recognize it. In the lmdb data, the image is stored in binary mode, and you can check the tool/create_dataset.py in detail. Can we change your data to my agreed lmdb format?

tumbleintoyourheart commented 5 years ago

Hi, the data runs in the code in a stable way. And you must change your data the be the same with mine so the code can recognize it. In the lmdb data, the image is stored in binary mode, and you can check the tool/create_dataset.py in detail. Can we change your data to my agreed lmdb format?

Hi, Yeah I used directly create_dataset.py you provided to create my dataset. I have search on the origin repo of crnn.pytorch for these issues but haven't found the same one. Seems like it's my own problem since people don't encounter this.

Holmeyoung commented 5 years ago

Hi, how did you use the create_dataset.py. Did you use the file mode or folder mode, and can you tell me the data format and your operation to create lmdb.

tumbleintoyourheart commented 5 years ago

Hi, how did you use the create_dataset.py. Did you use the file mode or folder mode, and can you tell me the data format and your operation to create lmdb.

I use the default option in create_dataset.py (not the 2 modes you provide), it receives 2 lists of images path and labels as input.

I store them in a json file like this: { "0": [ { "path": "/data_japan/images/005129_0.png", "class": "合同会社MMN東京都新宿区西早稲田3丁目19番5号" } ], "1": [ { "path": "/data_japan/images/005129_1.png", "class": "有限会社東真東京都台東区橋場1丁目24番20号" } ], "2": [ { "path": "/data_japan/images/005129_2.png", "class": "東伸製本有限会社東京都文京区目白台2丁目13番4号" } ], "3": [ { "path": "/data_japan/images/005129_3.png", "class": "2-13-2" } ],

Here are my code to create the dataset, using create_dataset.py: `with open('./data_japan/labels.json', encoding='utf8') as label_json: label_data = json.load(label_json)

image_path_list = [] label_list = [] number_of_samples = len(label_data.keys())

for i in range(number_of_samples): image_path_list.append(label_data[str(i)][0]['path'][1:]) label_list.append(label_data[str(i)][0]['class'])

train_path = 'dataset/train' valid_path = 'dataset/valid' number_of_valid = int(number_of_samples*0.2)

createDataset(train_path, image_path_list[number_of_valid:], label_list[number_of_valid:], map_size=4800000000) createDataset(valid_path, image_path_list[:number_of_valid], label_list[:number_of_valid], map_size=1200000000) `

tumbleintoyourheart commented 5 years ago

And I see in another issue, that loss would be inf if the text length is greater than 26. How do you determine this number?

Holmeyoung commented 5 years ago

Hi,

You should use my create_dataset.py to create lmdb. Because to make the Chinese or Japanese work, I story the image and label in binary mode. If you create a normal lmdb, but i treate it as binary in my code, there will be error.
In the model/crnn.py the RNN layer is 26 in T length. It means the max length is 26. If you want it to be 36 or larger, you should change the image resize width.

tumbleintoyourheart commented 5 years ago

Hi Holmeyoung,

Thank you for your time and support.

I'm sorry but I'm still confused with how to calculate the number T=26 given default image resize width imgW=100?

Holmeyoung commented 5 years ago

Hi, you can refer to #17 for detail.

tumbleintoyourheart commented 5 years ago

Thank you.

Holmeyoung / crnn-pytorch

Error when encoding cpu_texts with custom dataset #23