ericmckean / cld2

Automatically exported from code.google.com/p/cld2
0 stars 0 forks source link

Valgrind errors? #3

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi, thanks for the awesome library. I'm seeing a couple memory errors in 
valgrind when I use it.

The first:

==7805== Conditional jump or move depends on uninitialised value(s)
==7805==    at 0x4C2CB94: strcmp (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==7805==    by 0x43C412: CLD2::DoTLDLookup(char const*, CLD2::TLDLookup const*, 
int) (compact_lang_det_hint_code.cc:1034)
==7805==    by 0x43D705: CLD2::SetCLDTLDHint(char const*, CLD2::CLDLangPriors*) 
(compact_lang_det_hint_code.cc:1452)
==7805==    by 0x40CEB0: CLD2::ApplyHints(char const*, int, bool, 
CLD2::CLDHints const*, CLD2::ScoringContext*) (compact_lang_det_impl.cc:1504)
==7805==    by 0x40DC4F: CLD2::DetectLanguageSummaryV2(char const*, int, bool, 
CLD2::CLDHints const*, bool, int, CLD2::Language, CLD2::Language*, int*, 
double*, std::__1::vector<CLD2::ResultChunk, 
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*) 
(compact_lang_det_impl.cc:1644)
==7805==    by 0x409BAE: CLD2::DetectLanguageSummary(char const*, int, bool, 
char const*, int, CLD2::Language, CLD2::Language*, int*, int*, bool*) 
(compact_lang_det.cc:133)
==7805==    by 0x405932: codulus::main(int, char**) 
(test_language_detection.cc:43)
==7805==    by 0x406341: main (test_language_detection.cc:64)
==7805== 

This one seems reasonable to me, DoTLDLookup is using strcmp, but the value of 
'key' passed to it is not null terminated.

The other issue I see is an invalid read of one character past the end of my 
input in a couple places in the code:

==8337== Invalid read of size 1
==8337==    at 0x415932: CLD2::ScriptScanner::GetOneScriptSpan(CLD2::LangSpan*) 
(getonescriptspan.cc:973)
==8337==    by 0x415DAE: 
CLD2::ScriptScanner::GetOneScriptSpanLower(CLD2::LangSpan*) 
(getonescriptspan.cc:1074)
==8337==    by 0x40DCE9: CLD2::DetectLanguageSummaryV2(char const*, int, bool, 
CLD2::CLDHints const*, bool, int, CLD2::Language, CLD2::Language*, int*, 
double*, std::__1::vector<CLD2::ResultChunk, 
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*) 
(compact_lang_det_impl.cc:1707)
==8337==    by 0x40991E: CLD2::DetectLanguageSummary(char const*, int, bool, 
char const*, int, CLD2::Language, CLD2::Language*, int*, int*, bool*) 
(compact_lang_det.cc:133)
==8337==    by 0x405869: codulus::main(int, char**) 
(test_language_detection.cc:42)
==8337==    by 0x4060B1: main (test_language_detection.cc:63)

==8337== Invalid read of size 1
==8337==    at 0x414D3C: CLD2::UTF8OneCharLen(char const*) 
(utf8statetable.h:270)
==8337==    by 0x415A6D: CLD2::ScriptScanner::GetOneScriptSpan(CLD2::LangSpan*) 
(getonescriptspan.cc:991)
==8337==    by 0x415DAE: 
CLD2::ScriptScanner::GetOneScriptSpanLower(CLD2::LangSpan*) 
(getonescriptspan.cc:1074)
==8337==    by 0x40DCE9: CLD2::DetectLanguageSummaryV2(char const*, int, bool, 
CLD2::CLDHints const*, bool, int, CLD2::Language, CLD2::Language*, int*, 
double*, std::__1::vector<CLD2::ResultChunk, 
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*) 
(compact_lang_det_impl.cc:1707)
==8337==    by 0x40991E: CLD2::DetectLanguageSummary(char const*, int, bool, 
char const*, int, CLD2::Language, CLD2::Language*, int*, int*, bool*) 
(compact_lang_det.cc:133)
==8337==    by 0x405869: codulus::main(int, char**) 
(test_language_detection.cc:42)
==8337==    by 0x4060B1: main (test_language_detection.cc:63)

==8337== Invalid read of size 1
==8337==    at 0x41D1A3: 
CLD2::UTF8GenericPropertyTwoByte(CLD2::UTF8StateMachineObj_2 const*, unsigned 
char const**, int*) (utf8statetable.cc:403)
==8337==    by 0x414D24: CLD2::GetUTF8LetterScriptNum(char const*) 
(getonescriptspan.cc:1098)
==8337==    by 0x415A87: CLD2::ScriptScanner::GetOneScriptSpan(CLD2::LangSpan*) 
(getonescriptspan.cc:992)
==8337==    by 0x415DAE: 
CLD2::ScriptScanner::GetOneScriptSpanLower(CLD2::LangSpan*) 
(getonescriptspan.cc:1074)
==8337==    by 0x40DCE9: CLD2::DetectLanguageSummaryV2(char const*, int, bool, 
CLD2::CLDHints const*, bool, int, CLD2::Language, CLD2::Language*, int*, 
double*, std::__1::vector<CLD2::ResultChunk, 
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*) 
(compact_lang_det_impl.cc:1707)
==8337==    by 0x40991E: CLD2::DetectLanguageSummary(char const*, int, bool, 
char const*, int, CLD2::Language, CLD2::Language*, int*, int*, bool*) 
(compact_lang_det.cc:133)
==8337==    by 0x405869: codulus::main(int, char**) 
(test_language_detection.cc:42)
==8337==    by 0x4060B1: main (test_language_detection.cc:63)

For now, I'm working around this by passing (input, size - 1) instead of 
(input, size) to cld2. My input is not null terminated, if that makes a 
difference. It seems to happen with every input I try (they are all web pages, 
by the way). Also, I am running this on x64 linux.

Any ideas?

Original issue reported on code.google.com by cha...@gmail.com on 21 Aug 2013 at 6:04

GoogleCodeExporter commented 9 years ago
TLD test fixed
1 byte overshoots all from a missing end test; added
Thanks for the help. Please recheck with valgrind if easy. /dick

Original comment by dsi...@google.com on 24 Aug 2013 at 7:06

GoogleCodeExporter commented 9 years ago
Hi, thanks for working on this.

I'm now seeing a valgrind error at a different spot in getonescriptspan.cc: 

==8417== Invalid read of size 1
==8417==    at 0x430FEA: CLD2::ScriptScanner::GetOneScriptSpan(CLD2::LangSpan*) 
(getonescriptspan.cc:1013)
==8417==    by 0x43123E: 
CLD2::ScriptScanner::GetOneScriptSpanLower(CLD2::LangSpan*) 
(getonescriptspan.cc:1075)
==8417==    by 0x429439: CLD2::DetectLanguageSummaryV2(char const*, int, bool, 
CLD2::CLDHints const*, bool, int, CLD2::Language, CLD2::Language*, int*, 
double*, std::__1::vector<CLD2::ResultChunk, 
std::__1::allocator<CLD2::ResultChunk> >*, int*, bool*) 
(compact_lang_det_impl.cc:1707)
==8417==    by 0x42506E: CLD2::DetectLanguageSummary(char const*, int, bool, 
char const*, int, CLD2::Language, CLD2::Language*, int*, int*, bool*) 
(compact_lang_det.cc:133)
==8417==    by 0x41B188: codulus::GetLanguage(codulus::Slice const&, 
codulus::Slice const&, CLD2::Language*, int*, int*, bool*) 
(language_detection.cc:29)
==8417==    by 0x4056DC: codulus::main(int, char**) 
(test_language_detection.cc:41)
==8417==    by 0x405EE1: main (test_language_detection.cc:63)
==8417==  Address 0x6ac1afd is 0 bytes after a block of size 7,965 alloc'd
==8417==    at 0x4C2B6CD: malloc (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==8417==    by 0x5F9ED9: operator new(unsigned long) (new.cpp:50)
==8417==    by 0x40A842: codulus::Arena::AllocateNewBlock(unsigned long) 
(arena.cc:70)
==8417==    by 0x40A7A2: codulus::Arena::AllocateFallback(unsigned long) 
(arena.cc:36)
==8417==    by 0x41584F: codulus::Arena::Allocate(unsigned long) (arena.h:65)
==8417==    by 0x41F775: codulus::InterchangeValidUTF8(codulus::Slice const&, 
codulus::Arena*, unsigned long*) (print.cc:204)
==8417==    by 0x40567F: codulus::main(int, char**) 
(test_language_detection.cc:33)
==8417==    by 0x405EE1: main (test_language_detection.cc:63)
==8417== 

Maybe the while condition on line 1013 should be:
  'while ((0 < take) && (take < byte_length_) && ((next_byte_[take] & 0xc0) == 0x80))'
?

Also, for the tld hint fix in compact_lang_det_hint_code.cc, might it have an 
issue with two-letter tlds? Our internal fix for this guy was to change the 
strncpy's length argument from 'len' to 'len + 1'.

Original comment by cha...@gmail.com on 24 Aug 2013 at 9:19

GoogleCodeExporter commented 9 years ago
Not fully awake yesterday. Your first suggestion added; second changed to 
actual buffer size, as it should have been all along. /dick

Original comment by dsi...@google.com on 25 Aug 2013 at 5:17

GoogleCodeExporter commented 9 years ago
This looks great and fixes the valgrind errors we are seeing. Thanks!

P.S.
I think it'd be safe to remove the manual null termination at 
compact_lang_det_hint_code.cc:1451 now, if you want.

Original comment by cha...@gmail.com on 26 Aug 2013 at 7:48