图像识别：简单的从一张图片中识别出中文（Tesseract）

需求：从一张图片中识别出中文实现：使用 Python 并借助开源库 Tesseract 实现

Tesseract是一种开源的光学字符识别（OCR）引擎，可根据Apache 2.0许可证使用。它可以直接使用，或（对于程序员）使用API从图像中提取类型，手写或打印的文本。它支持各种语言。参考： https://github.com/tesseract-ocr/tesseract/wiki https://en.wikipedia.org/wiki/Tesseract_(software)

开发环境：

macOS
Python 3.6
brew

一、安装 tesseract

brew install tesseract

二、安装 Python 对应的包

pip3 install pytesseract

pip3-insatall-pytesseract

三、下载对应的中文训练数据

tesseract 支持多种语言：https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages

从 https://github.com/tesseract-ocr/tessdata 下载简体中文数据集 chi_sim.traineddata 到 /usr/local/Cellar/tesseract/3.05.01/share/tessdata 目录下：

chi_sim traineddata

四、Show the code

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract

# open image
image = Image.open('/Users/fatli/Desktop/dufu.png')
code = pytesseract.image_to_string(image, lang='chi_sim')
print(code)

code

附：英文识别 screenshotenglish

Qingquan-Li / blog

图像识别：简单的从一张图片中识别出中文（Tesseract） #71

一、安装 tesseract

二、安装 Python 对应的包

三、下载对应的中文训练数据

四、Show the code