AiPacino / tesseract-ocr

Automatically exported from code.google.com/p/tesseract-ocr
Other
2 stars 0 forks source link

Use-after-free bug in HOCR output #1197

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Generate HOCR output from any file while using the Valgrind's memory checker.
2. Observe complaints about using free'd memory.
3. Observe corrupted HOCR output (sometimes).

What is the expected output? What do you see instead?

The anatomy of bug is invocation like: hocr_str += HOcrEscape(something); where 
the memory returned by HOcrEscape() is managed by a local variable that is 
destroyed. A pointer is returned to freed memory. Sometimes garbage can be seen 
on the output file if the memory gets reused for another purpose before the 
string concatenation occurs.

What version of the product are you using? On what operating system?

Latest SVN on OS X.

Please provide any additional information below.

The issue can be fixed by a patch such as follows:

Index: api/baseapi.cpping - remove &<>"' with HTML codes. */
===================================================================
--- api/baseapi.cpp     (revision 1099), STRING& ret);
+++ api/baseapi.cpp     (working copy)
@@ -1366,7 +1366,11 @@

   hocr_str.add_str_int("  <div class='ocr_page' id='page_", page_id);
   hocr_str += "' title='image \"";
-  hocr_str += input_file_ ? HOcrEscape(input_file_->string()) : "unknown";
+  if (input_file_) {
+    HOcrEscape(input_file_->string(), hocr_str);
+  } else {
+    hocr_str += "unknown";
+  }
   hocr_str.add_str_int("\"; bbox ", rect_left_);
   hocr_str.add_str_int(" ", rect_top_);
   hocr_str.add_str_int(" ", rect_width_);
@@ -1443,7 +1447,7 @@
       const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL);
       if (grapheme && grapheme[0] != 0) {
         if (grapheme[1] == 0) {
-          hocr_str += HOcrEscape(grapheme);
+          HOcrEscape(grapheme, hocr_str);
         } else {
           hocr_str += grapheme;
         }
@@ -2568,9 +2572,8 @@
 }

 /** Escape a char string - remove <>&"' with HTML codes. */
-const char* HOcrEscape(const char* text) {
+const void HOcrEscape(const char* text, STRING& ret) {
   const char *ptr;
-  STRING ret;
   for (ptr = text; *ptr; ptr++) {
     switch (*ptr) {
       case '<': ret += "&lt;"; break;
@@ -2581,6 +2584,5 @@
       default: ret += *ptr;
     }
   }
-  return ret.string();
 }
 }  // namespace tesseract.
Index: api/baseapi.h
===================================================================
--- api/baseapi.h       (revision 1099)
+++ api/baseapi.h       (working copy)
@@ -865,7 +865,7 @@
 };  // class TessBaseAPI.

 /** Escape a char string - remove &<>"' with HTML codes. */
-const char* HOcrEscape(const char* text);
+const void HOcrEscape(const char* text, STRING& ret);
 }  // namespace tesseract.

 #endif  // TESSERACT_API_BASEAPI_H__

This is by no means the only corrupted memory usage according to Valgrind. I 
get a lot of output like this:

==3991== Invalid read of size 4
==3991==    at 0x5170385: QUAD_COEFFS::y(float) const (quadratc.h:41)
==3991==    by 0x516F8F8: QSPLINE::y(double) const (quspline.cpp:223)
==3991==    by 0x4F9A7BA: ROW::base_line(float) const (ocrrow.h:59)
==3991==    by 0x4FD5277: 
tesseract::PageIterator::Baseline(tesseract::PageIteratorLevel, int*, int*, 
int*, int*) const (pageiterator.cpp:485)
==3991==    by 0x4F94309: 
tesseract::AddBaselineCoordsTohOCR(tesseract::PageIterator const*, 
tesseract::PageIteratorLevel, STRING*) (baseapi.cpp:1289)

Something to do with the splines, evidently. The program seems to generate 
0-byte malloc() calls when creating QSPLINEs, which Valgrind complains about, 
and uses memory past the allocated area later.

Original issue reported on code.google.com by alank...@bel.fi on 11 May 2014 at 7:35

GoogleCodeExporter commented 9 years ago
thank. Fixed in r1100 (and r1101 ;-) )
Please do not copy&past patch, but attach it to issue.

Original comment by zde...@gmail.com on 11 May 2014 at 9:26