gali8 / Tesseract-OCR-iOS

Tesseract OCR iOS is a Framework for iOS7+, compiled also for armv7s and arm64.
http://www.nexor.it
MIT License
4.22k stars 949 forks source link

Add the possibility to load language files from any directory #158

Closed PatrickNLT closed 9 years ago

PatrickNLT commented 9 years ago

Having language files stored into the cache directory is rather restrictive as the system may clean this directory from time to time. My initializer allows the developer to store the language files somewhere else (typically in the Application Support directory).

coveralls commented 9 years ago

Coverage Status

Coverage increased (+0.18%) to 91.67% when pulling 7d7aefbe40a46e3a9e9b4197e65b14dd29becffd on PatrickNLT:master into 133c8cb8a35d5cf502b4775e07ec311b7aeb9b69 on gali8:master.

coveralls commented 9 years ago

Coverage Status

Coverage increased (+0.18%) to 91.67% when pulling 7d7aefbe40a46e3a9e9b4197e65b14dd29becffd on PatrickNLT:master into 133c8cb8a35d5cf502b4775e07ec311b7aeb9b69 on gali8:master.

coveralls commented 9 years ago

Coverage Status

Coverage increased (+0.19%) to 91.68% when pulling 0cf5a0a3669506697f74595bc3bec37c0a4d10f3 on PatrickNLT:master into 133c8cb8a35d5cf502b4775e07ec311b7aeb9b69 on gali8:master.

PatrickNLT commented 9 years ago

Thanks for the feedback.

I updated the doc, and added a flag to allow the user to copy the contents of the application bundle to the custom directory or not. I also added a test.

Does that sound good? :)

coveralls commented 9 years ago

Coverage Status

Coverage increased (+0.19%) to 91.68% when pulling 1f8faed7ca57853db6ae9067408d375c8c450273 on PatrickNLT:master into 133c8cb8a35d5cf502b4775e07ec311b7aeb9b69 on gali8:master.

coveralls commented 9 years ago

Coverage Status

Coverage increased (+0.19%) to 91.68% when pulling 1f8faed7ca57853db6ae9067408d375c8c450273 on PatrickNLT:master into 133c8cb8a35d5cf502b4775e07ec311b7aeb9b69 on gali8:master.

coveralls commented 9 years ago

Coverage Status

Coverage increased (+0.19%) to 91.68% when pulling 1f8faed7ca57853db6ae9067408d375c8c450273 on PatrickNLT:master into 133c8cb8a35d5cf502b4775e07ec311b7aeb9b69 on gali8:master.

PatrickNLT commented 9 years ago

Done!

coveralls commented 9 years ago

Coverage Status

Coverage increased (+0.19%) to 91.68% when pulling aa89bc4a7e84cf42252b37d7e5db3c39941c1264 on PatrickNLT:master into 133c8cb8a35d5cf502b4775e07ec311b7aeb9b69 on gali8:master.

coveralls commented 9 years ago

Coverage Status

Coverage increased (+0.19%) to 91.68% when pulling aa89bc4a7e84cf42252b37d7e5db3c39941c1264 on PatrickNLT:master into 133c8cb8a35d5cf502b4775e07ec311b7aeb9b69 on gali8:master.

kevincon commented 9 years ago

Thanks again for your contribution! Merging and closing.

ws233 commented 9 years ago

@PatrickNLT, thank you a lot for the feature. I have a few questions and comments about the feature branch.

Why do you think returning a nil in a such case is a good idea? nil data path means tesseract cannot continue to work from the app bundle? We tried to write code in a such way, that tesseract could work with any input parameters. As I already mentioned above, your copyFilesFromResources adds an ambiguity and wrong expectations. If I specify a nil cachesDataPath, tesseract starts and works. If I use your initializer and specifies nil absoluteDataPath, I don't receive tesseract instance at all. So, as I think a new initializer should be rewritten to support a hole concept of this repo. It should do as much work for user as possible and return nil only if there is really nothing to deal with input parameters.

ws233 commented 9 years ago

@kevincon, I'm really sorry that I've missed this branch. Unfortunately, I was not watching for the repo notifications. So I thought nothing is happening here. But I've changed the settings to watch for all of them, since I'm really interested to participate in its development.

PatrickNLT commented 9 years ago

@ws233 Thanks for the extensive feedback.

Generally speaking, the application does not require cache data to operate properly

But we do need these files to have a good OCR. So I don't consider language files as cache files.

To sum up, I think we could get rid of the boolean parameters as it can be confusing and we could have a default behavior in this case, but the point of this new initializer was to give more flexibility to the developer, so it can also be seen as a nice advanced option. And the tests can of course be updated if you think it's a good idea. :)

ws233 commented 9 years ago

Yup, I agree that we need to allow to safe tessdata to the 'AppSupport' folder. That could be necessary in some situations.

Regarding the form of the initializer. Could you explain more details about this advanced option? How it should work? If it's YES, files should be copied and shouldn't otherwise, right? If that's an idea, the current code doesn't satisfy these expectations in all the possible parameters values, that what I worry about. The code should work in every cases exactly like it's declared.

@kevincon, what is your vision? What do you think about a huge number of cases, where initializing tesseract returns nil?

PatrickNLT commented 9 years ago

Yes we're on the same page about the behavior of the boolean parameter. But in which case is it not working as intended?

ws233 commented 9 years ago

I've mentioned this above. Sorry, if it was not clear. b. not nil absoluteDataPath is specified together with copyFilesFromResources = YES parameter, while absoluteDataPath folder already has tessdata files in place.

In such case, symbolic link are not being recreated, but should be, shouldn't it?

Imagine the case, an app creates the links first. Than the user downloads a new version of some language file and rewrite the link. After some time tesseract starts with copyFilesFromResources = YES, because those downloaded files should be again rewritten with symbolic links back. But that will not happen due to conditions inside the 'moveTessdataToDirectoryIfNecessary' function. That's a line to look at:

        if (![fileManager fileExistsAtPath:destinationFileName]) {
PatrickNLT commented 9 years ago

OK now I see what you meant, thanks.

Well, I don't think symbolic links should be re-created in this case, because it goes against the logic you previously described for the cache: if the user updates the language files, you don't want to overwrite them. But you may want not to copy the files at all (for example if the user needs different languages than the one you embedded in the bundle), and that's when the option is useful.

I really think it's just a matter of clarifying the doc.

ws233 commented 9 years ago

I don't agree here. In such case that's one more ambiguity. From one side you say that YES parameter will copy files, from other side you say, that it will not in some situations. You see, too many different and specific cases. And many people will soon ask questions, why it works in a such way and not in another.

Perhaps, that's just a matter of clarifying the doc, but I think it's better to hide this parameter from the user at all.

Anyway, let's wait for @kevincon.

kevincon commented 9 years ago

On whether copyFilesFromResources should overwrite files

I agree with @PatrickNLT that setting the copyFilesFromResources parameter to YES should not overwrite any existing language files because I think the use case he describes (where an app has downloaded newer versions of language files) is a common one and the expected behavior is that the library should only copy files from the app bundle that do not already exist in the target directory.

However I can see @ws233's point of view about the parameter being ambiguous since copyFilesFromResources by itself sounds like it should overwrite any existing files (e.g. the Unix cp command overwrites by default when copying).

One solution is to just rename copyFilesFromResources to something like copyFilesFromResourcesWithoutOverwriting or copyAbsentFilesFromResources, but I think I agree with both of you that copyFilesFromResources can be removed altogether (see below).

Bigger picture

From a bigger picture point of view, I just think the library should support the following (let me know if you disagree or think more configurations should be supported):

  1. The "tessdata" folder exists in the app bundle and Tesseract is configured to use the app bundle's location for the lifetime of the app.
  2. The "tessdata" folder exists in the app bundle and is copied to a writable location, Tesseract is configured to use this location, the developer can update the language files and/or download new files to this location, and Tesseract can be reinitialized to use the newest versions of the files in the location (without overwriting the new files).
  3. The "tessdata" folder does not exist in the app bundle and instead the developer creates the "tessdata" folder programmatically (perhaps by downloading language files from the Internet at runtime) and Tesseract is configured to use the programmatically created "tessdata" folder for the lifetime of the app.

(1) is satisifed by the convenience initializers, (2) is satisfied by the initializer with the cachesRelatedDataPath parameter, and (3) is satisfied by the designated initializer, so I think the library satisfies these configurations currently.

But I agree that it is unnecessary to have the copyFilesFromResources parameter in the designated initializer since only (2) involves copying files from the app bundle.

So I agree with the proposal to remove copyFilesFromResources from the designated initializer and refactor the code so that the copying of the "tessdata" folder in the app bundle occurs exclusively in the initializer with the cachesRelatedDataPath parameter. How does that sound?

ws233 commented 9 years ago

Looks good to me.

PatrickNLT commented 9 years ago

Actually with your proposal 2. is not satisfied if we need another path than the cache path (my case precisely). We may want to seed the tessdata folder with maybe one language and the data necessary to use OSD. I would change it into:

remove copyFilesFromResources from the designated initializer and refactor the code so that the copying of the "tessdata" folder in the app bundle occurs if a folder (cache or any other location) is specified.

kevincon commented 9 years ago

Ah you're right, I forgot cachesRelatedDatapathroots in the Caches folder. I agree with your modification:

"Remove copyFilesFromResources from the designated initializer and refactor the code so that the copying of the "tessdata" folder in the app bundle occurs if a folder (cache or any other location) is specified."

And I guess (3) can be satisfied if the developer does not have a "tessdata" folder in the app bundle? I think the code should recognize that case and not try to copy anything from the app bundle to the specified folder.

PatrickNLT commented 9 years ago

Great, looks good to me too.

kevincon commented 9 years ago

I'll create a new issue for this task.