Closed kmatt closed 7 years ago
fflush() for each line did not show any noticeable memory savings.
Using find command to call Mawk interpreter per file addresses the memory consumption, but it would be instructive to understand the difference in memory usage between Mawk and Gawk in this case.
Offhand, issue #20 is the only interesting one with memory problems that I recall. But that is a buffer-management issue (no apparent relationship to the number of files).
Is this something that can be scaled down, e.g., to see the difference in handling a few files of a megabyte or so versus a few dozen?
Is this something that can be scaled down, e.g., to see the difference in handling a few files of a megabyte or so versus a few dozen?
I am currently feeding Mawk (1.3.4 20160615) via GNU Parallel, thus one Mawk "run" per source file, and this keeps memory manageable.
It seems that when give an continuous stream via cat, or a very large list of files, memory grows unchecked.
From another thread:
mawk keeps memory around until it isn't "used" (by keeping reference counts)
It is safe to say in this case that the issue I am seeing is not a memory leak, but an expected effect of how Mawk manages memory? #20
It is a memory leak caused by a 1.3.4 coding bozon. Your program will execute without leak if you use mawk 1.3.3 or https://github.com/mikebrennan000/mawk-2
@roscoewuce "coding bozon" is not a term I am familiar with or can find a definition for. Explain?
That's jargon used by Mike Brennan (which you might notice in comments in the code). Offhand, if you just s/bozo/blunder/ that might be close enough.
$ (ulimit -v 100000 ; mawk134 'BEGIN{while(1) printf "" > 325}' ) mawk134: run time error: out of memory FILENAME="" FNR=0 NR=0
Above is a simpler program that runs out of memory for the same reason.
ok - just another mawk 1.3.3 bug not yet addressed. (If it had been due to 1.3.4.x changes, I would point out which snapshot introduced the problem, and provide a workaround).
$ cat bz.awk
BEGIN { x = ARGV[1] while(x-- > 0) printf "" > 134
print ARGV[0], ARGV[1]
}
$ ( ulimit -v 100000
mawk133 -f bz.awk 10000000
mawk134 -f bz.awk 10000000
)
mawk133 10000000
mawk134: run time error: out of memory
FILENAME="" FNR=0 NR=0
I'm glad you appreciate the work I've done for you by reducing this to a small program. You're welcome.
This one I can fix. The previous one is, as I pointed out, an issue which is in 1.3.3 (and both are cases different from the issue originally reported).
Are you now talking about bug #20 ?
I uploaded a fix for the file-opening issue you pointed out; it might be what kmatt is seeing.
Is either v1.3.x or 1.9.x/2.x development hosted in a public repo. Currently all I see are the release bundles. Interested to see individual commits for these changes for edification.
This might help: https://github.com/ThomasDickey/mawk-snapshots
Kmatt, for this bug, there were no changes 1.9.9.x as it derives from 1.3.3 which never had the bug.
This example script combines a few thousand source data files to a few dozen based on a date column within:
And called with mawk -f crush.awk *.dat. However memory usage for the Mawk process climbs continuously until the script or RAM is exhausted - 128GB in one case, with a maximum of 31 open files (one month worth of source dates).
Is there an issue with the script, or is Mawk 1.3.4 leaking memory? I don't see this behavior with Gawk 4, but Mawk is much faster for this type of job.