Closed ryanmccool closed 1 year ago
The size is because not only the text content is saved, but rather the structure of the text (as HOCR), as well as a base64-encoded thumbnail. This will however change in the near future, as Obsidian-OCR is switching to a SQLite database.
@MohrJonas Thanks, and also thank you for the add-on. I'm looking forward to testing the new version. I appreciate the hard work!
As a feature suggestion, would be great to have a flag to not save that associated structure data.
The structure (though not yet implemented) stores where (relative to the page) the words are located (coordinate-wise) and is intended to be used to mark the found word occurrences in the document (kinda like a highlighter)
As the title suggests, after the first time I ran a scan, the subsequent json files were very large. Often times, it wasn't out of the ordinary to have the json file itself over 2x as large as the original image. For something that I assumed was only going to create text files, this was very surprising.
Is this by design, or is there something that I can do to reduce file size and only store identified text from the images?