ftnext / ocroy

https://pypi.org/project/ocroy/
MIT License
0 stars 0 forks source link

画像に書かれたテキストをUnstructuredで読み出す方法を知っている #4

Closed ftnext closed 11 months ago

ftnext commented 11 months ago

2 Tesseractラッパー

https://unstructured-io.github.io/unstructured/core/partition.html#partition-image

ftnext commented 11 months ago
from unstructured.partition.image import PartitionStrategy, partition_image

elements = partition_image(
    filename="kanji.png",
    strategy=PartitionStrategy.OCR_ONLY,
    languages=["jpn"],
)
text = "\n\n".join(str(e) for e in elements)
ftnext commented 11 months ago

半角スペースの除去は https://nikkie-ftnext.hatenablog.com/entry/remove-whitespace-in-text-with-regex の実装が使えそう