desmondmorris / node-tesseract

A simple wrapper for the Tesseract OCR package
Other
675 stars 118 forks source link

Is it possible? #46

Open SPlatten opened 7 years ago

SPlatten commented 7 years ago

For PDF's that contain text I am using pdf2json which gives me all the text nodes and PDF co-ordinates, for PDF's that do contain text I am using node-tesseract, however this extracts just the text, is it possible to get the co-ordinates of the text to go along with the output?

SPlatten commented 7 years ago

I think what I am asking can be achieved by getting tesseract to use the "hocr" option which will cause it to output html which includes box coordinates for each text item. Now the question is, can the module pass this?

SPlatten commented 7 years ago

Ok, I've modified tesseract.js inserting:

    command.push("hocr");

at line 70, this results in the output being HTML with box coordinates for every text item, is there another way of doing without modified tesseract.js ?

SPlatten commented 7 years ago

After searching around, it seems the built in supported way to do this is to add a 'format' option to the options array specifying 'hocr' as the value.

[edit]...unfortunately it didn't help...back to using the solution in the previous post.

SPlatten commented 7 years ago

Does anyone maintain this module anymore?

reecefenwick commented 7 years ago

You are honestly better off using a library that has native bindings to tesseract.

Or just replicate what this does, this library doesn't do anything special - in fact you could re-write it a lot cleaner with ES6 syntax

SPlatten commented 7 years ago

@reecefenwick, thank you, I did a search around today and from what I was able to find node-tesseract seems to be the best module for node.js

I will modify the code tonight and implement "hocr" via the options. I've also ordered a book on ES6 as so far I haven't been familiar with it or what it can do.

gforcelong commented 6 years ago

I think you can first modify the default var options at line22 of tesseract.js:

        options: {
               'l': 'eng',
               'psm': 3,
               'config': null,
               'binary': 'tesseract',
               'hocr':null
   },

then at line 70,add :

            if (options.hocr !== null) {
              command.push('hocr');
              }

in your code ,if you want to get hocr output ,do something like this:

          var options = {
               l: 'chi_sim+eng',
              psm: 4,
              hocr:'hocr'
            };

      tesseract.process( '/test.png', options, function(err, text) {
             if(err) {
                   console.error(err);
              } else {
                    console.log('----------------------------');
                console.log(text);
     }
});