Genivia / ugrep

NEW ugrep 7.1: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.66k stars 111 forks source link

ugrep -H --xml doesn't escape XML special chars in //file/@name attribute #430

Closed stephentalley closed 1 month ago

stephentalley commented 1 month ago

When ugrep creates XML output, file names with XML special characters cause the XML to be invalid:

% file=/tmp/'foo & bar'.txt
% echo hello > "$file"

% ugrep -H --xml hello "$file"
<grep>
    <file name="/tmp/foo & bar.txt">
        <match>hello</match>
    </file>
</grep>

% ugrep -H --xml hello "$file" | xmllint --format -
-:2: parser error : xmlParseEntityRef: no name
<file name="/tmp/foo & bar.txt">

These characters should be escaped when part of the file name:

"   &quot;
'   &apos;
<   &lt;
>   &gt;
&   &amp;

This, for example, would be valid XML:

<grep>
    <file name="/tmp/foo &amp; bar.txt">
        <match>hello</match>
    </file>
</grep>

Not sure if there are other attributes other than name that should be considered as well?

Thanks again for your work on this tool!

genivia-inc commented 1 month ago

OK, but it makes little to no sense to encode these characters as entities in the XML output, because the general consensus is to avoid them at all cost in filenames (pathnames) to ensure there are no issues with tools and portability problems. Some tools and some OS forbid them in pathnames. As such, I never deemed it necessary to handle these in XML as special.

See also: https://en.wikipedia.org/wiki/Filename

stephentalley commented 1 month ago

Users don't always control the names of the files they need to grep.

But more importantly, if it is a legal file name for the file system, then it should probably be supported by the tool.

genivia-inc commented 1 month ago

IMO only the & and the quote " should be escaped in XML attributes. XML is forgiving when < and > are used in attributes, which rarely if ever leads to interoperability issues as there aren't any tags in XML attributes.

genivia-inc commented 1 month ago

OK, I've added new fields %i and %I in the upcoming ugrep release to output pathnames in XML.

The --xml format will be output with %I instead of %H as follows:

--format-begin='<grep>%~'
--format-open='  <file%["]$%[ name="]I>%~'
--format='    <match%["]$%[ line="]N%[ column="]K%[ offset="]B>%X</match>%~%u'
--format-close='  </file>%~'
--format-end='</grep>%~'