adamritter / fastgron

High-performance JSON to GRON (greppable, flattened JSON) converter
MIT License
592 stars 10 forks source link

malloc(): corrupted top size when piping moderately sized json to stdin #16

Closed DrPyser closed 9 months ago

DrPyser commented 1 year ago

Hi. Great tool, thanks for this work!

I'm encountering this error when piping output of pip inspect, which generates a json object of 880412 bytes.

❯ stdbuf -i0 -o0 -e0 python3 -u -m pip inspect | fastgron
malloc(): corrupted top size
fish: Process 295949, 'fastgron' from job 1, 'stdbuf -i0 -o0 -e0 python3 -u -…' terminated by signal SIGABRT (Abort)

Or simplified:

❯ python3 -u -m pip inspect | fastgron
malloc(): corrupted top size
fish: Process 296216, 'fastgron' from job 1, 'python3 -u -m pip inspect | fas…' terminated by signal SIGABRT (Abort)

Any way I can get and provide further debugging information?

adamritter commented 9 months ago

Size shouldn't be a problem (fastgron should be able to handle mostly anything that fits in RAM), for me it works on my computer.

Please try to upload the file

json.environment.platform_version = "Darwin Kernel Version 23.0.0: Fri Sep 15 14:41:34 PDT 2023; root:xnu-10002.1.13~1/RELEASE_ARM64_T8103" json.environment.python_full_version = "3.9.4" json.environment.platform_python_implementation = "CPython" json.environment.python_version = "3.9" json.environment.sys_platform = "darwin"

cozzyd commented 9 months ago

I can reproduce on my system (F38) with gcc 13.2 and running pip inspect on my system.

Here is valgrind trace:

$ valgrind --track-origins=yes ./fastgron pip-inspect.json 
==878965== Memcheck, a memory error detector
==878965== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==878965== Using Valgrind-3.22.0 and LibVEX; rerun with -h for copyright info
==878965== Command: ./fastgron pip-inspect.json
==878965== 
==878965== Invalid write of size 1
==878965==    at 0x484F303: memmove (vg_replace_strmem.c:1410)
==878965==    by 0x41BC44: recursive_print_gron(simdjson::fallback::ondemand::value, growing_string&, growing_string&, unsigned int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (print_gron.cpp:202)
==878965==    by 0x41D611: recursive_print_gron(simdjson::fallback::ondemand::value, growing_string&, growing_string&, unsigned int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (print_gron.cpp:154)
==878965==    by 0x41D611: recursive_print_gron(simdjson::fallback::ondemand::value, growing_string&, growing_string&, unsigned int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (print_gron.cpp:154)
==878965==    by 0x41B566: recursive_print_gron(simdjson::fallback::ondemand::value, growing_string&, growing_string&, unsigned int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (print_gron.cpp:60)
==878965==    by 0x41D611: recursive_print_gron(simdjson::fallback::ondemand::value, growing_string&, growing_string&, unsigned int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (print_gron.cpp:154)
==878965==    by 0x405C40: main (fastgron.cpp:554)
==878965==  Address 0x6113b50 is 0 bytes after a block of size 2,000 alloc'd
==878965==    at 0x4844723: operator new[](unsigned long) (vg_replace_malloc.c:725)
==878965==    by 0x41B8A7: reserve_extra (growing_string.hpp:51)
==878965==    by 0x41B8A7: recursive_print_gron(simdjson::fallback::ondemand::value, growing_string&, growing_string&, unsigned int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (print_gron.cpp:167)
==878965==    by 0x41D611: recursive_print_gron(simdjson::fallback::ondemand::value, growing_string&, growing_string&, unsigned int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (print_gron.cpp:154)
==878965==    by 0x41D611: recursive_print_gron(simdjson::fallback::ondemand::value, growing_string&, growing_string&, unsigned int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (print_gron.cpp:154)
==878965==    by 0x41B566: recursive_print_gron(simdjson::fallback::ondemand::value, growing_string&, growing_string&, unsigned int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (print_gron.cpp:60)
==878965==    by 0x41D611: recursive_print_gron(simdjson::fallback::ondemand::value, growing_string&, growing_string&, unsigned int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (print_gron.cpp:154)
==878965==    by 0x405C40: main (fastgron.cpp:554)
==878965== 

valgrind: m_mallocfree.c:304 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 2064, hi = 3400000511170344040.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.

host stacktrace:
==878965==    at 0x58044A7A: show_sched_status_wrk (m_libcassert.c:407)
==878965==    by 0x58044BBF: report_and_quit (m_libcassert.c:478)
==878965==    by 0x58044D55: vgPlain_assert_fail (m_libcassert.c:544)
==878965==    by 0x5804ED88: get_bszB_as_is (m_mallocfree.c:304)
==878965==    by 0x5804ED88: is_inuse_block (m_mallocfree.c:332)
==878965==    by 0x5804ED88: vgPlain_describe_arena_addr (m_mallocfree.c:1614)
==878965==    by 0x5803CDAA: vgPlain_describe_addr (m_addrinfo.c:185)
==878965==    by 0x5803B3C7: vgMemCheck_update_Error_extra (mc_errors.c:1444)
==878965==    by 0x5803FFB0: vgPlain_maybe_record_error (m_errormgr.c:824)
==878965==    by 0x5803A240: vgMemCheck_record_address_error (mc_errors.c:917)
==878965==    by 0x10092E09C1: ???
==878965==    by 0x10090B1F0F: ???

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 878965)
==878965==    at 0x41BC4C: recursive_print_gron(simdjson::fallback::ondemand::value, growing_string&, growing_string&, unsigned int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (print_gron.cpp:206)
==878965==    by 0x41D611: recursive_print_gron(simdjson::fallback::ondemand::value, growing_string&, growing_string&, unsigned int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (print_gron.cpp:154)
==878965==    by 0x41D611: recursive_print_gron(simdjson::fallback::ondemand::value, growing_string&, growing_string&, unsigned int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (print_gron.cpp:154)
==878965==    by 0x41B566: recursive_print_gron(simdjson::fallback::ondemand::value, growing_string&, growing_string&, unsigned int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (print_gron.cpp:60)
==878965==    by 0x41D611: recursive_print_gron(simdjson::fallback::ondemand::value, growing_string&, growing_string&, unsigned int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) (print_gron.cpp:154)
==878965==    by 0x405C40: main (fastgron.cpp:554)
client stack range: [0x1FFEFFD000 0x1FFF000FFF] client SP: 0x1FFEFFEBD0
valgrind stack range: [0x1008FB2000 0x10090B1FFF] top usage: 64144 of 1048576

Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.

Using vgdb didn't seem to help too much with the default build flags, but by changing from O3 to Og, the culprit seems to be a line that is very very long (3239 characters).

Program received signal SIGTRAP, Trace/breakpoint trap.
0x000000000484f303 in _vgr20181ZZ_libcZdsoZa_memmove (dst=0x6113719, src=<optimized out>, len=<optimized out>) at ../shared/vg_replace_strmem.c:1410
1410    ../shared/vg_replace_strmem.c: No such file or directory.
Missing separate debuginfos, use: dnf debuginfo-install cyrus-sasl-lib-2.1.28-9.fc38.x86_64 glibc-2.37-14.fc38.x86_64 keyutils-libs-1.6.1-6.fc38.x86_64 krb5-libs-1.21-3.fc38.x86_64 libbrotli-1.0.9-11.fc38.x86_64 libcom_err-1.46.5-4.fc38.x86_64 libcurl-8.0.1-5.fc38.x86_64 libevent-2.1.12-8.fc38.x86_64 libgcc-13.2.1-4.fc38.x86_64 libidn2-2.3.4-2.fc38.x86_64 libnghttp2-1.52.0-2.fc38.x86_64 libpsl-0.21.2-2.fc38.x86_64 libselinux-3.5-1.fc38.x86_64 libssh-0.10.5-1.fc38.x86_64 libstdc++-13.2.1-4.fc38.x86_64 libunistring1.0-1.0-1.fc38.x86_64 libxcrypt-4.4.36-1.fc38.x86_64 openldap-2.6.6-1.fc38.x86_64
(gdb) bt
#0  0x000000000484f303 in _vgr20181ZZ_libcZdsoZa_memmove (dst=0x6113719, src=<optimized out>, len=<optimized out>) at ../shared/vg_replace_strmem.c:1410
#1  0x000000000040f872 in recursive_print_gron (element=..., path=..., out_growing_string=..., flags=flags@entry=389, filters=std::vector of length 0, capacity 0) at /home/cozzy/Downloads/fastgron/src/print_gron.cpp:202
#2  0x000000000040f26b in recursive_print_gron (element=..., path=..., out_growing_string=..., flags=flags@entry=389, filters=std::vector of length 0, capacity 0) at /home/cozzy/Downloads/fastgron/src/print_gron.cpp:154
#3  0x000000000040f26b in recursive_print_gron (element=..., path=..., out_growing_string=..., flags=flags@entry=389, filters=std::vector of length 0, capacity 0) at /home/cozzy/Downloads/fastgron/src/print_gron.cpp:154
#4  0x000000000040d2cd in recursive_print_gron (element=..., path=..., out_growing_string=..., flags=flags@entry=389, filters=std::vector of length 0, capacity 0) at /home/cozzy/Downloads/fastgron/src/print_gron.cpp:60
#5  0x000000000040f26b in recursive_print_gron (element=..., path=..., out_growing_string=..., flags=389, filters=std::vector of length 0, capacity 0) at /home/cozzy/Downloads/fastgron/src/print_gron.cpp:154
#6  0x000000000040582d in main (argc=<optimized out>, argv=<optimized out>) at /home/cozzy/Downloads/fastgron/src/fastgron.cpp:554
(gdb) frame 1
#1  0x000000000040f872 in recursive_print_gron (element=..., path=..., out_growing_string=..., flags=flags@entry=389, filters=std::vector of length 0, capacity 0) at /home/cozzy/Downloads/fastgron/src/print_gron.cpp:202
202         memcpy(ptr, s.data(), s.size());
(gdb) p out_growing_string 
$1 = (growing_string &) @0x449880: {
  data = 0x6113380 "json = \033[1;34m{}\033[0m\njson.\033[1;34mversion\033[0m = \033[1;32m\"1\"\033[0m\njson.\033[1;34mpip_version\033[0m = \033[1;32m\"23.2.1\"\033[0m\njson.\033[1;34minstalled\033[0m = \033[1;34m[]\033[0m\njson.\033[1;34minstalled\033[0m\033[1;34m[\033[1;32m0\033[1;3"..., len = 815, capacity = 2000}
(gdb) p s.size()
$2 = 3239
(gdb) p s.data()
$3 = (const std::basic_string_view<char, std::char_traits<char> >::value_type *) 0x59b73be "\"<img align=\\\"right\\\" src=\\\"https://uber.github.io/img/h3Logo-color.svg\\\" alt=\\\"H3 Logo\\\" width=\\\"200\\\">\\n\\n# **h3-py**: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in Python\\n\\n<!-- T"...
(gdb) p ptr
$4 = 0x6113719 "\"<img align=\\\"right\\\" src=\\\"https://uber.github.io/img/h3Logo-color.svg\\\" alt=\\\"H3 Logo\\\" width=\\\"200\\\">\\n\\n# **h3-py**: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in Python\\n\\n<!-- T"...
(gdb) p ptr - out_growing_string.data
$5 = 921

I only looked very cursorily, but I don't see anywhere where there's a check that there is enough room for the element. If I apply the attached patch it fixes it for me, though I don't know if the fix should really go somewhere else. 0001-fix-clobbering-the-heap-on-long-lines-by-making-sure.patch.txt

adamritter commented 9 months ago

Wow, thanks for the fix, I introduced this bug in an optimization step in the inner loop. I added your fix and released v0.7.0 as it's an essential upgrade for everybody.