Closed dangoodman closed 6 years ago
This seems like a good idea, but I think it doesn't go far enough. May I suggest something like this for escaping column values:
function escapeLogValue($input) {
$regex = '/[[:^graph:]]/';
//preg_replace_callback will return NULL if the input contains invalid Unicode sequences,
//so only enable the Unicode flag if the input encoding looks valid.
if ( mb_check_encoding($input, 'UTF-8') ) {
$regex = $regex . 'u';
}
$result = str_replace('\\', '\\\\', $input);
$result = preg_replace_callback(
$regex,
function($matches) {
$length = strlen($matches[0]);
$escaped = '';
for($i = 0; $i < $length; $i++) {
//Convert the character to a hexadecimal escape sequence.
$hexCode = dechex(ord($matches[0][$i]));
$escaped .= '\x' . strtoupper(str_pad($hexCode, 2, '0', STR_PAD_LEFT));
}
return $escaped;
},
$result
);
return $result;
}
This should escape not only spaces and line breaks but also all non-printable characters, NULL-bytes, and invalid Unicode sequences.
As for padding, I think it would be better to just move the code that adds padding so that it runs after filterLogInfo
/escapeLogInfo
. Adding, removing and then re-adding padding seems inelegant.
I was looking to the [:print:]
character class but then skipped it since it works differently on windows and linux. Namely, on windows the Tab (\t)
character is considered as printable, while on linux it's non-printable.
The [:graph:]
class doesn't seem to have that issue as per my tests. Looks good to me.
As of the unicode-related stuff, to be honest my brain isn't big enough to keep all the details about it required to handle it correctly and safely. All those invalid sequences, 5 and 6 byte length chars, normalization forms, etc. So it's ok to me if it's ok to you ;)
Agree about re-adding padding. My intention was to keep escaping the last step in the chain to make sure no breaking chars pass by. If we add a protected method, say, formatLogIngo()
, supposed to pad fields or decorate them in other way, somebody will forget about the requirement to produce output-safe values.
I don't know much about Unicode either. That's why I'm relying on PCRE's Unicode support and mbstring
instead of trying to implement more advanced validation. Presumably/hopefully, the people who wrote those extensions know this stuff better than we do.
(That reminds me: we should probably check if mb_check_encoding
exists before calling it because mbstring
is apparently a non-default extension.)
If we add a protected method, say,
formatLogIngo()
, supposed to pad fields or decorate them in other way, somebody will forget about the requirement to produce output-safe values.
Hmm, I suppose that's possible. I would keep the padding code in logRequest
for now and just move it down. For example (untested code):
$columns = $this->filterLogInfo($columns, $request);
$columns = $this->escapeLogInfo($columns);
if ( isset($columns['ip'] ) {
$columns['ip'] = str_pad($columns['ip'], 15, ' ');
}
if ( isset($columns['http_method'] ) {
$columns['http_method'] = str_pad($columns['http_method'], 4, ' ');
}
Here it is.
This escapes special chars in logged data to avoid breaking the log file with something like
?installed_version=%0D%0A%09+
in url. The purpose of the added methodescapeLogInfo()
is similar tomysql[_real]_escape_string()
,htmlspecialchars()
,escapeshellarg()
, etc.