Use-after-free bug in HOCR output

What steps will reproduce the problem?
1. Generate HOCR output from any file while using the Valgrind's memory checker.
2. Observe complaints about using free'd memory.
3. Observe corrupted HOCR output (sometimes).

What is the expected output? What do you see instead?

The anatomy of bug is invocation like: hocr_str += HOcrEscape(something); where 
the memory returned by HOcrEscape() is managed by a local variable that is 
destroyed. A pointer is returned to freed memory. Sometimes garbage can be seen 
on the output file if the memory gets reused for another purpose before the 
string concatenation occurs.

What version of the product are you using? On what operating system?

Latest SVN on OS X.

Please provide any additional information below.

The issue can be fixed by a patch such as follows:

Index: api/baseapi.cpping - remove &<>"' with HTML codes. */
===================================================================
--- api/baseapi.cpp     (revision 1099), STRING& ret);
+++ api/baseapi.cpp     (working copy)
@@ -1366,7 +1366,11 @@

   hocr_str.add_str_int("  <div class='ocr_page' id='page_", page_id);
   hocr_str += "' title='image \"";
-  hocr_str += input_file_ ? HOcrEscape(input_file_->string()) : "unknown";
+  if (input_file_) {
+    HOcrEscape(input_file_->string(), hocr_str);
+  } else {
+    hocr_str += "unknown";
+  }
   hocr_str.add_str_int("\"; bbox ", rect_left_);
   hocr_str.add_str_int(" ", rect_top_);
   hocr_str.add_str_int(" ", rect_width_);
@@ -1443,7 +1447,7 @@
       const char *grapheme = res_it->GetUTF8Text(RIL_SYMBOL);
       if (grapheme && grapheme[0] != 0) {
         if (grapheme[1] == 0) {
-          hocr_str += HOcrEscape(grapheme);
+          HOcrEscape(grapheme, hocr_str);
         } else {
           hocr_str += grapheme;
         }
@@ -2568,9 +2572,8 @@
 }

 /** Escape a char string - remove <>&"' with HTML codes. */
-const char* HOcrEscape(const char* text) {
+const void HOcrEscape(const char* text, STRING& ret) {
   const char *ptr;
-  STRING ret;
   for (ptr = text; *ptr; ptr++) {
     switch (*ptr) {
       case '<': ret += "&lt;"; break;
@@ -2581,6 +2584,5 @@
       default: ret += *ptr;
     }
   }
-  return ret.string();
 }
 }  // namespace tesseract.
Index: api/baseapi.h
===================================================================
--- api/baseapi.h       (revision 1099)
+++ api/baseapi.h       (working copy)
@@ -865,7 +865,7 @@
 };  // class TessBaseAPI.

 /** Escape a char string - remove &<>"' with HTML codes. */
-const char* HOcrEscape(const char* text);
+const void HOcrEscape(const char* text, STRING& ret);
 }  // namespace tesseract.

 #endif  // TESSERACT_API_BASEAPI_H__

This is by no means the only corrupted memory usage according to Valgrind. I 
get a lot of output like this:

==3991== Invalid read of size 4
==3991==    at 0x5170385: QUAD_COEFFS::y(float) const (quadratc.h:41)
==3991==    by 0x516F8F8: QSPLINE::y(double) const (quspline.cpp:223)
==3991==    by 0x4F9A7BA: ROW::base_line(float) const (ocrrow.h:59)
==3991==    by 0x4FD5277: 
tesseract::PageIterator::Baseline(tesseract::PageIteratorLevel, int*, int*, 
int*, int*) const (pageiterator.cpp:485)
==3991==    by 0x4F94309: 
tesseract::AddBaselineCoordsTohOCR(tesseract::PageIterator const*, 
tesseract::PageIteratorLevel, STRING*) (baseapi.cpp:1289)

Something to do with the splines, evidently. The program seems to generate 
0-byte malloc() calls when creating QSPLINEs, which Valgrind complains about, 
and uses memory past the allocated area later.
Original issue reported on code.google.com by alank...@bel.fi on 11 May 2014 at 7:35
AiPacino / tesseract-ocr

Use-after-free bug in HOCR output #1197