OCR python library or wrapper

hectorespert commented 6 years ago

Would it be possible to have a library for OCR in the project? Like pytesseract or tesserocr

I think that some parsers would be possible if we have the possibility to use OCR.

corradio commented 6 years ago

In principle yes. But I think that's overkill to be honest. Also it might be pushing the scraping a bit too far. What do others think?

systemcatch commented 6 years ago

Hey @blackleg I had a go at some OCR in #606 . I was using tesseract on the command line to try and get their dashboard data into usable form. Initial results were ok but not good enough to be reliable. I tried tweaking various settings but wasn't able improve the results much, however it should be possible to get it to work.

What other parsers did you have in mind?

hectorespert commented 6 years ago

Maharashtra State at India, the web page that shows generation and production data use a jpg image.

corradio commented 6 years ago

Same with #787 actually, but I still think we're compromising too much on precision of the values if we do OCR. For that reason I'm closing this issue. Feel free to re-open if you disagree.

Olivier

alixunderplatz commented 6 years ago

There are 3 issues now that would add new regions to the map (Namibia, Maharashtra (in India) and Aland) if there was a reliable way to use OCR. For Singapore and Europe, some improvements could be achieved. It would be really cool if we could try to continue with the idea of using OCR and tweak the OCR settings to get each of these regions going. I am currently collecting images every once in a while so in a week we will have an album to track reliability for different constellations of numbers and text.

I tried to drop some images on the tesseract.js page and could already improve the results by manually doubling image size to 200% beforehand without any other settings. This could be a general first step for large improvements of OCR results and data quality.

See this image for Namibia: text in the lower right corner is with original image size, upper right after changing size to 200%. Also, if possible, selecting certain areas/pixel ranges/colours of an image may lead to the desired data, because mostly, these won't really change over time.

Summary of the issues with OCR demand:

606 Namibia - generation, exchanges and system load as seperate .png images from this site.
304 Maharashtra (India) (see comment from 24. Aug 2018) - thermal, hydro and exchanges and PV and wind.
883 Aland (Finland) - exchange with Finish mainland, load, and generation from wind and fossil fuels.
653 Singapore PV - each minute, the PV generation is updated
1015 Some European TSOs (CH, HU, SK, ...) publish real-time exchanges and generation as images.

systemcatch commented 6 years ago

Interesting, I've had several goes at getting OCR to work for Namibia, using various Image Magick techniques to improve the quality for tesseract. It's always been close but not good enough for what we need.

Resizing the image hadn't occurred to me. Increasing it by 300% while maintaining the aspect ratio gives really good accuracy. I think it would work in a parser now.

alixunderplatz commented 6 years ago

@systemcatch Especially having #606 Namibia :namibia: would be a great thing in general for the African continent (besides the Canary Islands).

Edit: Do you know whether it is possible to select only certain areas of an image or cut out relevant parts to be OCRed?

electricitymaps / electricitymaps-contrib

OCR python library or wrapper #817

Summary of the issues with OCR demand:

606 Namibia - generation, exchanges and system load as seperate .png images from this site.

304 Maharashtra (India) (see comment from 24. Aug 2018) - thermal, hydro and exchanges and PV and wind.

883 Aland (Finland) - exchange with Finish mainland, load, and generation from wind and fossil fuels.

653 Singapore PV - each minute, the PV generation is updated

1015 Some European TSOs (CH, HU, SK, ...) publish real-time exchanges and generation as images.