MohrJonas / obsidian-ocr

Obsidian OCR allows you to search for text in your images and pdfs
GNU General Public License v3.0
279 stars 5 forks source link

Very large json files after running the 1st time scan #29

Closed ryanmccool closed 1 year ago

ryanmccool commented 1 year ago

As the title suggests, after the first time I ran a scan, the subsequent json files were very large. Often times, it wasn't out of the ordinary to have the json file itself over 2x as large as the original image. For something that I assumed was only going to create text files, this was very surprising.

Is this by design, or is there something that I can do to reduce file size and only store identified text from the images?

MohrJonas commented 1 year ago

The size is because not only the text content is saved, but rather the structure of the text (as HOCR), as well as a base64-encoded thumbnail. This will however change in the near future, as Obsidian-OCR is switching to a SQLite database.

ryanmccool commented 1 year ago

@MohrJonas Thanks, and also thank you for the add-on. I'm looking forward to testing the new version. I appreciate the hard work!

As a feature suggestion, would be great to have a flag to not save that associated structure data.

MohrJonas commented 1 year ago

The structure (though not yet implemented) stores where (relative to the page) the words are located (coordinate-wise) and is intended to be used to mark the found word occurrences in the document (kinda like a highlighter)