core dump when opening gz file

marilmanen commented 5 months ago

I'm using following command to see what's inside a gz file

  nedit-ng -do 'replace_range(0,0,shell_command(gzip -cd test1.gz", ""))'

if the uncompressed file is 608Mbytes with 19M lines everything works fine, but with file size of 725Mbytes with 25M lines I get

terminate called after throwing an instance of 'std:bad_alloc'
  what(): std::bad_alloc
Abort (core dumped)

There is no issue with the bigger file if it's first uncompressed to a file and then I open the file with nedit-ng. I have tested also old NEdit editor and there are no issues with it.

eteran commented 5 months ago

that's very interesting. Does it matter which .gz file you try to open? Or is it one in particular? (If so, any chance you can somehow send me the offending file?).

I'll take a look, often a std::bad_alloc on a 64-bit system means that something was given a negative size somewhere :-/

marilmanen commented 5 months ago

I modified the content of the spef file (replace all word characters to x) and it had no impact, so it looks like the only thing that matters is the size of the file. I also created a dummy file by duplicating following sequence until the uncompressed file was >728Mbytes and with it the I get the crash. After a couple iterations it looks like the limit for crash is very close to 715Mbytes

Here is the content that I used

*xxx

*xxxxx *xxxxxxx x.xxxxxxxxxx //xxxxxx x.xxx xxxxxx x.xxxxxxxxxx xx

*xxxx
*x *xxxxxx:xx x *x x *x xx.xxx xxx.xxx
*x *xxxxxx:x x *x x.xxxxxxxxx *x xx.xxx xxx.xxx
*x *xxxxxxx:x *x xx.xxx xxx.xxx
*x *xxxxxxx:x *x xx.xxx xxx.xxx
*x *xxxxxxx:x *x xx.xxx xxx.xxx

*xxx
x *xxxxxx:xx xx-xx
x *xxxxxx:x x.xxxxxxxxxx
x *xxxxxxx:x xx-xx
x *xxxxxxx:x x.xxxxxxxxxx
x *xxxxxxx:x x.xxxxxxx-xx
xx *xxxxxx:x *xxxxxxx:xx x.xxxxxxx-xx
xx *xxxxxx:x *xxxxxxx:xx x.xxxxxx-xx
xx *xxxxxx:x *xxxxxx:x x.xxxxx-xx
xx *xxxxxx:x *xxxxxx:x x.xxxxxxx-xx
xx *xxxxxx:x *xxxxxxx:x x.xxxxxx-xx

eteran commented 5 months ago

So, i made a script like this:

#!/bin/bash

IN=test.txt
OUT=test
FILE=test.gz
MINSIZE=728000000
COUNT=0

truncate -s 0 $OUT
while [[ 1 ]]; do
    echo $COUNT

    # append the source file 81920 times
    cat $(yes $IN | head -81920) >> $OUT

    gzip -f $OUT -c > $FILE

    SIZE=$(wc -c <"$FILE")
    if [ $SIZE -ge $MINSIZE ]; then
        echo size is over $MINSIZE bytes
        exit 0
    fi

    COUNT=$((COUNT+1))
done

To try to test, and I have a couple of questions:

is this generally what you meant?
how big is the source file? because a repeated pattern like that compresses particular well, I've gotten the uncompressed file over 1GB an it only results in a 3MB tgz file!

eteran commented 5 months ago

OK, never mind that last comment. I mis-read it and thought that the .gz file needed to be a certain size, not the source file. I've replicated the issue and will see if I can fix it ASAP :-)

eteran commented 5 months ago

This is an interesting situation. it may not be obvious at first, but this is actually running into a circumstance where we are hitting the memory limit of what can be held in a QString.

I was able to reproduce this with a very trivial Qt application that looks like this:

#include <QByteArray>
#include <QFile>
#include <QString>
#include <QtDebug>

int main() {

    QFile file(QLatin1String("test.txt"));
    if (file.open(QIODevice::ReadOnly)) {
        QByteArray bytes = file.readAll();
        QString text = QString::fromLocal8Bit(bytes.data(), bytes.size());
        qDebug() << text;
    }
}

with a file.txt that is 1854668800 bytes big. Fundamentally, QString is limited to ~2GB of storage, and the number of characters is at best half of that because they use UTF-16 (it can be less than half due to combining characters and similar). I know you triggered it with a smaller file, but I think it's essentially the same issue because some QString operations require even more space temporarily.

Soi, back to nedit-ng. Fortunately, we don't actually use QString for file data that often, but we do currently use it for capturing stdout and stderr of subprocesses. In this case, I read the results of command you ran (in this case, gzip) into a byte array, and then because it could be UTF-8, I decode it into a QString (this is where it blows up), and finally, if all goes well, I convert it to a character buffer as needed.

I'll have to refactor the code to use a different approach since QString has this limitation. I'll update when I have it worked out.

eteran commented 5 months ago

@marilmanen I believe that this PR should fix the issue, if it does, please let me know and I'll merge it into master. Thanks!

https://github.com/eteran/nedit-ng/pull/355

marilmanen commented 5 months ago

I tested with couple large files and no issues, so it looks like you have fixed the issue. Great!

eteran / nedit-ng

core dump when opening gz file #354