Memory leak processing large number of files?

ThomasDickey / original-mawk

bug-reports for mawk (originally on GoogleCode)

http://invisible-island.net/mawk/mawk.html

17 stars 2 forks source link

Memory leak processing large number of files? #44

Closed kmatt closed 7 years ago

kmatt commented 7 years ago

This example script combines a few thousand source data files to a few dozen based on a date column within:

BEGIN { FS="-"; OFS="," }
{
    gsub(/\|/, ",", $7)  # pipe to csv
    gsub(/,[^0-9]|,$/, ",0", $7)  # null measures to zero
    print $1"-"$2"-"$3" "$4":00", $5, $6, $7 >> "out_" $1 $2 $3 ".csv"
}

And called with mawk -f crush.awk *.dat. However memory usage for the Mawk process climbs continuously until the script or RAM is exhausted - 128GB in one case, with a maximum of 31 open files (one month worth of source dates).

Is there an issue with the script, or is Mawk 1.3.4 leaking memory? I don't see this behavior with Gawk 4, but Mawk is much faster for this type of job.

kmatt commented 7 years ago

fflush() for each line did not show any noticeable memory savings.

Using find command to call Mawk interpreter per file addresses the memory consumption, but it would be instructive to understand the difference in memory usage between Mawk and Gawk in this case.

ThomasDickey commented 7 years ago

Offhand, issue #20 is the only interesting one with memory problems that I recall. But that is a buffer-management issue (no apparent relationship to the number of files).

Is this something that can be scaled down, e.g., to see the difference in handling a few files of a megabyte or so versus a few dozen?

kmatt commented 7 years ago

Is this something that can be scaled down, e.g., to see the difference in handling a few files of a megabyte or so versus a few dozen?

I am currently feeding Mawk (1.3.4 20160615) via GNU Parallel, thus one Mawk "run" per source file, and this keeps memory manageable.

It seems that when give an continuous stream via cat, or a very large list of files, memory grows unchecked.

kmatt commented 7 years ago

From another thread:

mawk keeps memory around until it isn't "used" (by keeping reference counts)

It is safe to say in this case that the issue I am seeing is not a memory leak, but an expected effect of how Mawk manages memory? #20

roscoewuce commented 7 years ago

It is a memory leak caused by a 1.3.4 coding bozon. Your program will execute without leak if you use mawk 1.3.3 or https://github.com/mikebrennan000/mawk-2

kmatt commented 7 years ago

@roscoewuce "coding bozon" is not a term I am familiar with or can find a definition for. Explain?

ThomasDickey commented 7 years ago

That's jargon used by Mike Brennan (which you might notice in comments in the code). Offhand, if you just s/bozo/blunder/ that might be close enough.

mikebrennan000 commented 7 years ago

$ (ulimit -v 100000 ; mawk134 'BEGIN{while(1) printf "" > 325}' ) mawk134: run time error: out of memory FILENAME="" FNR=0 NR=0

Above is a simpler program that runs out of memory for the same reason.

ThomasDickey commented 7 years ago

ok - just another mawk 1.3.3 bug not yet addressed. (If it had been due to 1.3.4.x changes, I would point out which snapshot introduced the problem, and provide a workaround).

mikebrennan000 commented 7 years ago

$ cat bz.awk

BEGIN { x = ARGV[1] while(x-- > 0) printf "" > 134

print ARGV[0], ARGV[1]

}

$ ( ulimit -v 100000
mawk133 -f bz.awk 10000000 mawk134 -f bz.awk 10000000 ) mawk133 10000000 mawk134: run time error: out of memory FILENAME="" FNR=0 NR=0

I'm glad you appreciate the work I've done for you by reducing this to a small program. You're welcome.

ThomasDickey commented 7 years ago

This one I can fix. The previous one is, as I pointed out, an issue which is in 1.3.3 (and both are cases different from the issue originally reported).

mikebrennan000 commented 7 years ago

Are you now talking about bug #20 ?

ThomasDickey commented 7 years ago

I uploaded a fix for the file-opening issue you pointed out; it might be what kmatt is seeing.

kmatt commented 7 years ago

Is either v1.3.x or 1.9.x/2.x development hosted in a public repo. Currently all I see are the release bundles. Interested to see individual commits for these changes for edification.

ThomasDickey commented 7 years ago

This might help: https://github.com/ThomasDickey/mawk-snapshots

mikebrennan000 commented 7 years ago

Kmatt, for this bug, there were no changes 1.9.9.x as it derives from 1.3.3 which never had the bug.

ThomasDickey commented 7 years ago

http://invisible-island.net/mawk/CHANGES.html#t20160905