Open iAlexMG opened 1 year ago
Ok i modified the dataset.py to use the TAB in csv file and it's work well
dataset.py at line 153 :
self.df = pd.read_csv(os.path.join(root,'labels.txt'), sep='\t', engine='python', usecols=['filename', 'words'], keep_default_na=False)
Also, since lines can't be sorted directly according to the numerical value of the photo name, I've simply chosen all photo names starting with "1" for EN_VAL and the rest for _TRAIN. Which represents 12.5% of the total dataset. There is my new .CSV files and the new .TXT files : labels_val.csv labels_train.csv labels_val.txt labels_train.txt
I also spent lots of time figuring out these issues. (my issue: my model wasn't able to recognize comma characters) Thanks! i will also use tab as separator
On this site : https://www.jaided.ai/easyocr/modelhub/ Dataset link : en_sample.zip
if you download the Dataset .csv file you will find some issues in there :
Normally, all the data should be in the 1st column. if you scroll down, you will see that several data are in the 2nd column.
So I think it's impossible to use the dataset directly without modifying it. Personally, I added a ";" caracter and the data in the B cells in the A cells. For the 11 photos on line 71, I've simply transcribed them onto different lines.
Is it just me who doesn't understand how to use easy_ocr or is there really a problem with the .csv? I think the ideal solution would be to create a CSV using a TAB as separator. What do you think?
THX and sorry for asking this question!
Alex