JaidedAI / EasyOCR

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Chinese, Arabic, Devanagari, Cyrillic and etc.
https://www.jaided.ai
Apache License 2.0
23.45k stars 3.07k forks source link

Problem with the labels.csv file #1056

Open iAlexMG opened 1 year ago

iAlexMG commented 1 year ago

On this site : https://www.jaided.ai/easyocr/modelhub/ Dataset link : en_sample.zip

if you download the Dataset .csv file you will find some issues in there :

Normally, all the data should be in the 1st column. if you scroll down, you will see that several data are in the 2nd column.

  1. At cell B71, we have eleven pictures in the same cell : 95-50-93-52-96-91-98-89-94-63-102.jpg. I don't know why they are together.
  2. Apart from line 71 in point 2 above, all other cells in column B with data appear to have been separated to simulate ";". (169-393-561-677-685-768-910-946-97.jpg. )
  3. In addition, the ";" character is not found in the choices provided to the function pd.read_csv( sep='^([^,]+),' ) in the dataset.py
  4. not forgetting that the CSV separator character is the comma, but that this comma is also found inside the text of certain photos

So I think it's impossible to use the dataset directly without modifying it. Personally, I added a ";" caracter and the data in the B cells in the A cells. For the 11 photos on line 71, I've simply transcribed them onto different lines.

Is it just me who doesn't understand how to use easy_ocr or is there really a problem with the .csv? I think the ideal solution would be to create a CSV using a TAB as separator. What do you think?

THX and sorry for asking this question!

Alex

iAlexMG commented 1 year ago

Ok i modified the dataset.py to use the TAB in csv file and it's work well

dataset.py at line 153 : self.df = pd.read_csv(os.path.join(root,'labels.txt'), sep='\t', engine='python', usecols=['filename', 'words'], keep_default_na=False)

Also, since lines can't be sorted directly according to the numerical value of the photo name, I've simply chosen all photo names starting with "1" for EN_VAL and the rest for _TRAIN. Which represents 12.5% of the total dataset. There is my new .CSV files and the new .TXT files : labels_val.csv labels_train.csv labels_val.txt labels_train.txt

BMukhtar commented 7 months ago

I also spent lots of time figuring out these issues. (my issue: my model wasn't able to recognize comma characters) Thanks! i will also use tab as separator