Calamari-OCR / calamari

Line based ATR Engine based on OCRopy
GNU General Public License v3.0
1.05k stars 211 forks source link

Add better support for Brazilian Portuguese #360

Closed insinfo closed 1 month ago

insinfo commented 3 months ago

I did a test to OCR scanned documents in Brazilian Portuguese, and I saw that calamari/ocr4all makes a lot of mistakes on scanned documents in Portuguese

0001 0002

age: 0001 | Line: r_0000_l001 | Prediction: '‪i‬'
Page: 0001 | Line: r_0000_l002 | Prediction: '‪‬'
Page: 0001 | Line: r_0000_l003 | Prediction: '‪˘˘‬'
Page: 0001 | Line: r_0000_l004 | Prediction: '‪‬'
Page: 0001 | Line: r_0001_l001 | Prediction: '‪‬'
Page: 0001 | Line: r_0001_l002 | Prediction: '‪r‬'
Page: 0001 | Line: r_0001_l003 | Prediction: '‪‬'
Page: 0001 | Line: r_0002_l003 | Prediction: '‪‬'
Page: 0001 | Line: r_0003_l001 | Prediction: '‪SS7a  l  JAne‬'
Page: 0001 | Line: r_0003_l002 | Prediction: '‪Prefeitura Municipal de Rio das Ostras‬'
Page: 0001 | Line: r_0003_l003 | Prediction: '‪pr07OCOl0 GeRAl.‬'
Page: 0001 | Line: r_0004_l001 | Prediction: '‪Cet ́)‬'
Page: 0001 | Line: r_0004_l002 | Prediction: '‪etsss īS7  2003 et a082003 r 10:53:56‬'
Page: 0001 | Line: r_0004_l003 | Prediction: '‪8erēæ coscasε cua e[‬'
Page: 0001 | Line: r_0004_l004 | Prediction: '‪(=‬'
Page: 0001 | Line: r_0004_l005 | Prediction: '‪ssc es. secetane mult ds azrs‬'
Page: 0001 | Line: r_0004_l006 | Prediction: '‪ o —‬'
Page: 0001 | Line: r_0004_l007 | Prediction: '‪spesr eχ e s s lscaeχ2o‘ ;‬'
Page: 0001 | Line: r_0004_l008 | Prediction: '‪ δ !0‬'
Page: 0001 | Line: r_0004_l009: '‪Assto | Prediction: AVa‬'
Page: 0001 | Line: r_0005_l006 | Prediction: '‪‬'

the correct thing would be


ESTADO DO RIO DE JANEIRO
Prefeitura Municipal de Rio das Ostras
PROTOCOLO GERAL

Processo: 18457 / 2003
Data: 03/09/2003
Hora: 10:53:56
Requerente: COSCARELLI E CIA LTDA ME
Sec. Destino: Secretaria Municipal de Fazenda
Dept. Destino: Depto. de Tributos e Fiscalização
Assunto: ALVARÁ
bertsky commented 1 month ago

You did not specify the model you were using (and the image you showed is not segmented yet, so hard to tell to what extent the errors are caused by bad layout recognition). AFAIK there is no public model for Portuguese yet. The only modern model is uw3-modern-english, but naturally, this would perform much worse on non-English text.