creatale / node-dv

A node.js library for processing and understanding scanned documents
Other
340 stars 72 forks source link

add set/get variable methods in tesseract module #11

Closed chrox closed 10 years ago

chrox commented 10 years ago

In this way we can generate hOCR file without recognition.

tesseract.tessedit_make_boxes_from_boxes = true;
tesseract.findText('hocr', 0);
schulzch commented 10 years ago

Nice work! I think this patch could be further improved by publishing Tesseract variables as JavaScript attributes (automatic type conversion, enumeration of all variables in Tesseract objects), like this:

tesseract.tessedit_make_boxes_from_boxes = 1;
// or using some magic:
tesseract.tesseditMakeBoxesFromBoxes = true;
tesseract.findText('hocr', 0);

Do you like this idea?

chrox commented 10 years ago

It will be really cool to enumerate Tesseract variables and convert them automatically to JavaScript attributes. The only way I can find to list available variables is calling Tesseract::PrintVariables API which requires a file stream(C FILE pointer) to receive variables dump. I'm wondering if it's OK to stream the dump to a tmp file and read back and parse the variables.

schulzch commented 10 years ago

Streaming and Parsing is way too complicated - GlobalParams() can be used for this ( see: baseapi.cpp:155 ).

chrox commented 10 years ago

Now all tesseract variables including global variables and member variables of Tesseract class are automatically converted to attributes in JavaScript object.

schulzch commented 10 years ago

Thanks a lot! I've applied it with some changes for MSVC.