ZOSOpenTools / utils

Deprecated - Moved to https://github.com/ZOSOpenTools/meta
https://github.com/ZOSOpenTools/meta
Apache License 2.0
3 stars 2 forks source link

Only use regex for files with spaces in the name #90

Closed dougburns closed 1 year ago

dougburns commented 1 year ago

From the comments in https://github.com/ZOSOpenTools/utils/commit/28f4dee93f0e1156f54e168e56a2537a66015227#commitcomment-87668713%3E the regex substitution introduced by that change causes performance problems on some ports (presumably based on the number of files that need tagging). This change uses the old (simpler) mechanism for files without a space, and just resorts to the regex method when the filename does have a space. I did try some alternatives using substrings and indexes, but I wasn't convinced that they would handle all cases.

dougburns commented 1 year ago

@MikeFultonDev can you try this with php and see if it solves the performance issue?

MikeFultonDev commented 1 year ago

Another thought... See if the new 'gawkport' does any better? I'm ok with you checking this in if you want too - I expect it is faster than the previous version.

dougburns commented 1 year ago

There's not much point in merging the current change if it doesn't help much. I'll set up a meaningful test so that I can measure the performance of this chain of commands. I'm not sure that awk is the issue here. I suspect that it is the xargs in the middle of the chain. If I'm reading the manual correctly, xargs with -I will create a new set of processes for the remainder of the pipe for every argument it receives, so a large number of files will result in a large number of processes. Ideally we want a pipe that handles all the files with a single set of processes.

dougburns commented 1 year ago

Finally got round to doing some real testing on this. I wrote test scripts to set up a directory with a variety of tagged and untagged files, including files and subdirs containing spaces. The script also copies in the php source tree to give it a lot of work to do. It then executes a command similar to the file tagging from zopen-build, but it writes the untagged filenames to an output file. The time taken to run this command is measured (to the nearest second).

Tested commands:

(cd "tagTestDir" && find . ! -type d ! -type l | xargs -I {} chtag -qp {} | awk '{ if ($1 == "-") { sub(/^[ ]*([^ ]+ +){3}/, ""); print $0; }}' | xargs -I {} echo {} >>$OUTFILE)
(cd "tagTestDir" && find . ! -type d ! -type l | xargs -I {} chtag -qp {} | awk '{ if ($1 == "-") { if (NF == 4) { print $4 } else { sub(/^[ ]*([^ ]+ +){3}/, ""); print $0; }}}' | xargs -I {} echo {} >>$OUTFILE)
(cd "tagTestDir" && chtag -qpR . | awk '{ if ($1 == "-") { sub(/^[ ]*([^ ]+ +){3}/, ""); print $0; }}' | xargs -I {} echo {} >>$OUTFILE)
(cd "tagTestDir" && chtag -qpR . | awk '{ if ($1 == "-") { if (NF == 4) { print $4 } else { sub(/^[ ]*([^ ]+ +){3}/, ""); print $0; }}}' | xargs -I {} echo {} >>$OUTFILE)

The timings (in the order above) were 160s, 97s, 1s, 1s.

I think this shows that the original change performs a little better, but the xargs is the real problem and getting rid of it sorts the performance problem. All 4 tests produce the same list of files in the output, so the recursive chtag seems to do what we need. The one minor down-side that I've found is that chtag reports a warning if it comes across a symbolic link where the target doesn't exist. Processing continues, but it would clutter up the log if any of the projects have broken links (hopefully not common!).

Also tested zopen build for m4port, makeport and perlport to check it doesn't break builds.