Closed ftnext closed 11 months ago
from unstructured.partition.image import PartitionStrategy, partition_image
elements = partition_image(
filename="kanji.png",
strategy=PartitionStrategy.OCR_ONLY,
languages=["jpn"],
)
text = "\n\n".join(str(e) for e in elements)
半角スペースの除去は https://nikkie-ftnext.hatenablog.com/entry/remove-whitespace-in-text-with-regex の実装が使えそう
2 Tesseractラッパー
https://unstructured-io.github.io/unstructured/core/partition.html#partition-image