process_boxes() unknown chars and misidentifies chars

victoic commented 5 years ago

I started training again and noticed many characters not being identified as existing in the codec_rev. The data is from icdar2015, icdar2017 (MLT) and icdar2019 (MLT) and the provided codec.txt is used.

Stranger still is that the same error (unknown char) is showing up for data from icdar2015, which is completely composed of english characters.

unknown-chars

As shown by the image above, the character "थ" is not found in the codec_rev, but is found in GT for image 277 from icdar2015. However that's the GT from image 277:

gt-277

Is there some file enconding for the codec.txt that I must set? Can you provide some information about why is this happening?

MichalBusta commented 5 years ago

On Wed, 15 May 2019 10:12 Victor Lundgren, notifications@github.com wrote:

I started training again and noticed many characters not being identified as existing in the codec_rev. The data is from icdar2015, icdar2017 (MLT) and icdar2019 (MLT) and the provided codec.txt is used.

code is for MLT 2017 version. So Hindi chars are missing.

Stranger still is that the same error (unknown char) is showing up for data from icdar2015, which is completely composed of english characters.

[image: unknown-chars] https://user-images.githubusercontent.com/9040771/57793581-0b0f8e00-7718-11e9-85ad-48666bb3c656.png

As shown by the image above, the character "थ" is not found in the codec_rev, but is found in GT for image 277 from icdar2015. However that's the GT from image 277:

[image: gt-277] https://user-images.githubusercontent.com/9040771/57793882-cb957180-7718-11e9-9201-23bc174f427c.png

Is there some file formatting for the codec.txt that I must set? Can you provide some information about why is this happening?

There are some naming conventions (relative gt path ...), please read data_gen.py

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/MichalBusta/E2E-MLT/issues/33?email_source=notifications&email_token=AA7KHMEYKUVA2F7FZXOOM43PVQ77BA5CNFSM4HNFJXQ2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GT7KDJA, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7KHMGNRPNIG72CF7KPKMTPVQ77BANCNFSM4HNFJXQQ .

victoic commented 5 years ago

code is for MLT 2017 version. So Hindi chars are missing.

Right, but as can be seem by the first "Unknown char" message in the first image, the error is given by a image from icdar2017 as well.

There are some naming conventions (relative gt path ...), please read data_gen.py

Ok, I've read it. From what I understand there is a path relevance to how the GT is loaded. My dataset path looks like this:

images/
- trainMLT.txt
- icdar-2015-Ch4/
  - Train/
  - (images and gt here)
- done/
  - icdar-2017-mlt/
    - (images and gt here)
  - icdar-2019-mlt/
    - (images and gt here)

Which seems to be ok by the generator class and the example directory in the repository. Am I understanding it wrong?

MichalBusta commented 5 years ago

On Wed, 15 May 2019 21:10 Victor Lundgren, notifications@github.com wrote:

code is for MLT 2017 version. So Hindi chars are missing.

Right, but as can be seem by the first "Unknown char" message in the first image, the error is given by a image from icdar2017 as well.

I have no other explanation that for icdar 2017 it reads gt from mlt, since the image names are the same. Sorry can't help more - I'm without access to computer

There are some naming conventions (relative gt path ...), please read data_gen.py Ok, I've read it. From what I understand there is a path relevance to how the GT is loaded. My dataset path looks like this:

images/

trainMLT.txt

icdar-2015-Ch4/

Train/

(images and gt here)

done/

icdar-2017-mlt/

(images and gt here)

icdar-2019-mlt/

(images and gt here)

Which seems to be ok by the generator class and the example directory in the repository. Am I understand wrong?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/MichalBusta/E2E-MLT/issues/33?email_source=notifications&email_token=AA7KHMDWX6XGHAFXNF4B3QDPVTNDDA5CNFSM4HNFJXQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVQTE7Y#issuecomment-492909183, or mute the thread https://github.com/notifications/unsubscribe-auth/AA7KHMDUX7SC4TU345WF77DPVTNDDANCNFSM4HNFJXQQ .

MichalBusta / E2E-MLT

process_boxes() unknown chars and misidentifies chars #33