random crash and hang in option --zmax>1 when nested jar has a decompression error

Genivia / ugrep

NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more

https://ugrep.com

BSD 3-Clause "New" or "Revised" License

2.57k stars 109 forks source link

random crash and hang in option --zmax>1 when nested jar has a decompression error #313

Closed mmyjona closed 10 months ago

mmyjona commented 10 months ago

Given the following folder structure, ugrep may segmentation fault or deadlock when using with -z and -r arg.

  --root folder (pwd)
    |-- a.jar
           |-- b.jar （within a.jar） 
    |-- b.jar

when decompress the jar file, following warning log indicated there are potential rc:

ugrep: warning: cannot decompress b.jar: a zlib decompression error was detected in the zip compressed data ugrep: warning: cannot decompress b.jar: EOF detected in the zip compressed data

genivia-inc commented 10 months ago

Looks like a decompression failure for some reason that triggers a more severe problem (hang) afterwards. I'm not convinced it is a rc problem.

More information is needed. What are all of the options you've used? Do you use --zmax=2 to search nested zip/jar files? How many threads (try -J1 to run with one thread) cause the problem? What decompression libraries were linked (ug --version shows a list of libs).

mmyjona commented 10 months ago

the full command is:

ugrep -a -z --zmax=2 -r "xxx" .

the -J1 doesn't make the hanging and segmentation fault go away. So, it's not a rc problem after all.

version is 4.3.2 for macosx installed with homebrew:

ugrep 4.3.2 x86_64-apple-darwin23.0.0 +avx2; -P:pcre2jit; -z:zlib,bzip2,lzma
License: BSD-3-Clause;  ugrep user guide:  https://ugrep.com
Written by Robert van Engelen and others:  https://github.com/Genivia/ugrep
Ugrep utilizes the RE/flex regex library:  https://github.com/Genivia/RE-flex

mmyjona commented 10 months ago

here is the backtrace when it segmentation fault

genivia-inc commented 10 months ago

Thanks for sharing the details and trace!

I'm surprised that zlib crashes there. After the last inflate() call that returns an error, we still need to call inflateEnd() to release zlib memory and also see here. We should not just return from decompress(), since that may result in zlib memory leaks. Note that setting zend_ = true prevents further inflate() calls when we're done decompressing or when an error occurred, upon which we call inflateEnd(). If inflateEnd() crashes, that would be bad.

genivia-inc commented 10 months ago

Ive' made some progress. I can replicate a hang on an zip/zlib error with --zmax=2 by creating a corrupt nested zip file stored in a zip file:

> ./build.sh CXXFLAGS=-g
> cp tests/Hello.java test.java
> zip test.zip test.java
> head -c80 test.zip > test_corrupt.zip ; rm -f test_err.zip ; zip test_err.zip test_corrupt.zip test.java
> src/ugrep -z --zmax=2 '' test_err.zip
ugrep: warning: cannot decompress test_corrupt.zip: EOF detected in the zip compressed data
(hangs)

Error recovery should not hang. With head -c60 there is an error produced, but no hang. A hang is specific to the way the nested zip is corrupted. There are lots of internal checks to catch errors, but recovery is always a bit of a challenge. We want error recovery and don't want to quit searching after a zip warning.

genivia-inc commented 10 months ago

In Zthread::join() in ugrep.cpp:1021 we should also notify the pipe_ready cv, not only the pipe_zstrm cv. This is needed when something bad happens in the decompressor like an error. It recovers and then waits for a new pipe on cv pipe_ready to start decompressing again with other data, but now gets stuck on that cv wait. To prevent that, the change to Zthread::join() is:

    if (thread.joinable())
    {
      std::unique_lock<std::mutex> lock(pipe_mutex);
      quit = true;
      if (!waiting)
      {
        // wait until decompression thread closes the pipe
        pipe_close.wait(lock);
      }
      else
      {
        // decompression thread should terminate but it's waiting
        pipe_zstrm.notify_one(); // release thread waiting in close_wait_zstream_open()
        pipe_ready.notify_one(); // release thread waiting or in wait_pipe_ready()
      }
      lock.unlock();

      // now wait for the decomprssion thread to join
      thread.join();
    }

The logic looks good to me, but this is not enough. Some additional work is needed to improve error recovery to make sure progress can be made. Who said error recovery is easy peasy? 😳

genivia-inc commented 10 months ago

In addition, Zthread::start() ugrep.cpp:814 should do the same:

      if (thread.joinable())
      {
        // wake decompression thread
        pipe_zstrm.notify_one(); // wake thread waiting in close_wait_zstream_open(), there is work to do
        pipe_ready.notify_one(); // wake thread waiting in wait_pipe_ready(), when it encountered an error
      }

This prevents a hang when something went wrong in the decompression thread, which continues to run looking for a new open pipe with other data to decompress.

Edit: adding a test case to demonstrate:

> head -c80 test.zip > test_corrupt.zip ; rm -f test_err.zip ; zip test_err.zip test_corrupt.zip test.java
> ugrep -J1 -z --zmax=2 '' test_err.zip test.zip

Note that -J1 is important here, which means that the Grep object sends the next file test.zip to the decompressor instead of searching in parallel with two threads allocated to search test_err.zip and test.zip.

genivia-inc commented 10 months ago

I've committed a src/ugrep.cpp update to build and test.

mmyjona commented 10 months ago

I pulled your recovery fixes in ugrep.cpp and reverted the early return modifications. Then I added some log to see if inflateEnd is called after sub-zero ret code.


          // decompress non-empty zbuf_[] into the given buf[]
          if (z_strm_->avail_in > 0 && z_strm_->avail_out == 0)
          {
            z_strm_->next_out  = buf;
            z_strm_->avail_out = len;

            std::cout << "before inflate444" << pathname_  << num << "\n";
            ret = inflate(z_strm_, Z_NO_FLUSH);
            std::cout << "after inflate444" << pathname_  << num << "\n";

            zcur_ = zlen_ - z_strm_->avail_in;

            if (ret != Z_OK && ret != Z_STREAM_END && ret != Z_BUF_ERROR)
            {
              cannot_decompress(pathname_, "a zlib decompression error was detected in the zip compressed data");
              num = -1;
              zend_ = true;
              if (ret < Z_OK) {
                std::cout << "ret < 0 456" << "\n";
//                return false;
              }
            }

// read compressed data into zbuf_[] and decompress zbuf_[] into the given buf[]
          if (ret == Z_OK && z_strm_->avail_in == 0 && num < static_cast<int>(len))
          {
            if (read())
            {
              z_strm_->next_in  = zbuf_;
              z_strm_->avail_in = zlen_;

              if (num == 0)
              {
                z_strm_->next_out  = buf;
                z_strm_->avail_out = len;
              }
              std::cout << "before inflate480" << pathname_  << num << "\n";
              ret = inflate(z_strm_, Z_NO_FLUSH);
              std::cout << "after inflate480" << pathname_  << num << "\n";

              zcur_ = zlen_ - z_strm_->avail_in;

              if (ret != Z_OK && ret != Z_STREAM_END && ret != Z_BUF_ERROR)
              {
                cannot_decompress(pathname_, "a zlib decompression error was detected in the zip compressed data");
                num = -1;
                zend_ = true;
                if (ret < Z_OK) {
                  std::cout << "ret < 0 492" << "\n";
//                  return false;
                }
              }

Then I add some log to the inflateEnd call

          if (zend_)
          {
            std::cout << "before inflateEnd" << pathname_  << num << "\n";
            ret = inflateEnd(z_strm_);
            std::cout << "after inflateEnd" << pathname_   << num  << "\n";
            if (ret != Z_OK)
            {
              std::cout << "inflateEnd returns" << ret << pathname_   << num  << "\n";
              cannot_decompress(pathname_, "a zlib decompression error was detected in the zip compressed data");
              num = -1;
            }
          }

I can still reproduce the hang and here is the log:

ret < 0 456
before inflateEndb.jar-1
after inflateEndb.jar-1
after inflate480/Users/xxx/src/test_repo/t/t2/b.jar3153
before inflate444/Users/xxx/src/test_repo/t/t2/b.jar0
after inflate444/Users/xxx/src/test_repo/t/t2/b.jar0
before inflateEnd/Users/xxx/src/test_repo/t/t2/b.jar274
after inflateEnd/Users/xxx/src/test_repo/t/t2/b.jar274
ugrep: warning: cannot decompress b.jar: a zlib decompression error was detected in the zip compressed data
（hang）

inflateEnd do sometimes return -2, but not in this case.

mmyjona commented 10 months ago

I found where it hung:

ugrep.cpp:1204
            if (is_selected && !drain && write(pipe_fd[1], buf, static_cast<size_t>(len)) < len)

the write call never return.

genivia-inc commented 10 months ago

OK. That's a more obvious problem to look into where the other end of the pipe is still open but no longer reading, probably because of an early termination on a zip error and/or when trying to recover from that error.

Both ends close each time a zip-compressed file is processed, then reopen a pipe for the next archive part to process until the entire zip was processed. Closing one end causes a broken pipe if data is pending and write() terminates with < len data sent, which is what we want in this case. The logic at ugrep.cpp:1284 skips files that are not selected to be searched (is_selected) and when we're not draining the input that should not be sent to the pipe (!drain).

This may seem complex to search zip files, but keep in mind we can handle nested zip this way and also skip over parts we don't want to search or that we cannot search. Tarballs are processed differently, where we just decompress them and then break up the stream of tar contents to search without using pipes and just --zmax=1. When tarballs are located inside zip, then the zip thread does the same thing to allow searching tarballs in zip (with --zmax=2).

genivia-inc commented 10 months ago

I noticed that the placement of the new "wake pipe_ready" may cause another issues, since the pipe fd is not -1 (not closed). It should be moved up to the begin of start():

 reset pipe descriptors, pipe is closed
    pipe_fd[0] = -1;
    pipe_fd[1] = -1;

    // wake decompression thread waiting in wait_pipe_ready(), when it encountered an error
    if (thread.joinable() && waiting)
      pipe_ready.notify_one();

This still needs work though. I'm not 100% convinced that error recovery with nested zip can't cause problems down the pipe.

It would be great if you can share your a.jar and b.jar so I can replicate your observations then my tests are more complete. Testing with artificially introduced errors isn't great.

mmyjona commented 10 months ago

I noticed that the placement of the new "wake pipe_ready" may cause another issues, since the pipe fd is not -1 (not closed). It should be moved up to the begin of start():
 reset pipe descriptors, pipe is closed
    pipe_fd[0] = -1;
    pipe_fd[1] = -1;

    // wake decompression thread waiting in wait_pipe_ready(), when it encountered an error
    if (thread.joinable() && waiting)
      pipe_ready.notify_one();
This still needs work though. I'm not 100% convinced that error recovery with nested zip can't cause problems down the pipe.

It would be great if you can share your a.jar and b.jar so I can replicate your observations then my tests are more complete. Testing with artificially introduced errors isn't great.

I can confirm add if(waiting) to start() fix the hanging. Now the exception simply throw sigabrt to the top and “libc++abi: terminating ” is outputted to the stderr

genivia-inc commented 10 months ago

This only addresses synchronization, not if a var is undefined leading to a crash. The path(s) on which that happens is not yet clear to me, which is harder to find without a debug trace that I can run myself. All pipe open/close calls should also be paired w/o double close (which will crash), which might be the issue here when threads get out of sync.

Also, the waiting thread in wait_pipe_ready() may still observe pipe_fd[1] != -1 when the notifying thread races ahead of the waking thread, leading to a rc. To fix that, we need to reuse waiting as a state to check instead of pipe_fd[1] or we need another condition variable. I prefer to reuse waiting to communicate this state change:

  bool wait_pipe_ready()
  {
    if (pipe_fd[1] == -1)
    {
      // signal close and wait until a new zstream pipe is ready
      std::unique_lock<std::mutex> lock(pipe_mutex);

      pipe_close.notify_one();
      waiting = true;
      pipe_ready.wait(lock);

      // the receiver did not create a new pipe in close_file()
      if (!waiting)
      {
        lock.unlock();
        return false;
      }
      waiting = false;
      lock.unlock();
    }

    return true;
  }

Note that I often explicitly call unlock() despite the implicit unlock when the lock object is released. An explicit lock is maintainable, e.g. when we change the code later we don't end up with code inside the lock before the return that should not be there. An explicit unlock makes the code more clear.

Then we set waiting = false in Zthread::join() by the notifying thread to let Zthread::wait_pipe_ready() return false (no new pipe):

    if (thread.joinable())
    {
      std::unique_lock<std::mutex> lock(pipe_mutex);

      // decompression thread should quit to join
      quit = true;

      if (!waiting)
      {
        // wait until decompression thread closes the pipe
        pipe_close.wait(lock);
      }
      else
      {
        // wake decompression thread waiting in close_wait_zstream_open(), there is no more work to do
        pipe_zstrm.notify_one();

        // wake decompression thread waiting in wait_pipe_ready(), when it encountered an error
        waiting = false;
        pipe_ready.notify_one();
      }

      lock.unlock();

      // now wait for the decomprssion thread to join
      thread.join();
    }

and in Zthread::start() also:

      if (thread.joinable())
      {
        // wake decompression thread waiting in close_wait_zstream_open(), there is work to do
        pipe_zstrm.notify_one();

        // wake decompression thread waiting in wait_pipe_ready(), when it encountered an error
        waiting = false;
        pipe_ready.notify_one();
      }

This is a step forward to make sure there is no randomness to clear the hang on a cv wait. However, there is still some behavior I see that isn't looking right to me e.g. when searching the next zip file it will be skipped instead, which indicates that the error state isn't cleared yet and the next zip ends up down the wrong pipe perhaps? That could also explain a potential crash when a pipe is used twice and closed twice. Just my 2c.

genivia-inc commented 10 months ago

OK, I've found the problem in the Zthread::decompress() logic flow when --zmax exceeds 1: the logic would jump out of a loop when nothing was decompressed (e.g. an error), which skipped over some code that causes the threads to de-synchronize. This de-synchronization can lead to all sorts of problems with a hang on a cv the most obvious case because there is no notify (early loop exit). A crash is unlikely, I have not seen any crash with any of my negative test cases to reproduce the problem. The only possibly crash scenario would be a double close on the pipe fds when the threads de-synchronized and the double fd close calls happens at around the same time (highly unlikely, since each closing ends checks before closing).

The jump out of the loop is fine when --zmax=1 (default). Since --zmax was added later, I had overlooked this as a potential problem that would occur when a zip or decompressor has an internal error. Oops.

The earlier suggested changes help, but those should not be needed when the threads run properly side-by-side.

Now, it is still early to cheer on success when there is still testing and analysis needed to make sure everything works like it should be. Testing negative cases (errors in the input) is always harder than testing correct input. I have to rely on creating "fake" negative test cases. However, with the clean logic I'm more confident that I don't have to generate a lot of "fake" negative test cases. We will see...

genivia-inc commented 10 months ago

I've committed my progress on this issue. Rolled back the previous changes I made to notify pipe_ready and instead updated Zthread::decompress() to perform the necessary state changes that were jumped over. Some other minor (mostly cosmetic and for consistency) changes are included.