Open wguest-rsol opened 3 years ago
:tada: Thanks for opening your first issue here! Welcome to the community!
Thanks for reporting this and offering such a thorough report. We have had a lot of trouble with DELIMS as Splunk's CSV parsing doesn't follow RFC 4180. I appreciate your suggested solutions. Perhaps we could extract everything else with DELIMS and extract just the misc field with a regex instead. We'll look into this using the sample events you shared.
I need to bring this back up. I don't think Splunk DELIMS is intended to be a full featured CSV parser and it doesn't claim to be for that directive. It states in the Splunk spec file for transforms that DELIMS will use a delimiter and escape characters are backslashes. It specifically states that if another escape character is needed then you shouldn't use DELIMS. From the spec file: "* IMPORTANT: If a value may contain an embedded unescaped double quote character, such as "foo"bar", use REGEX, not DELIMS. An escaped double quote (\") is ok. Non-ASCII delimiters also require the use of REGEX."
So although it doesn't follow the RFC 4180 for CSV, it seems beside the point since it doesn't seem to claim RFC compliance or even specifically CSV for DELIMS. There are some serious implications to this mismatch. Some log messages are completely misinterpreted.
This is what I'm coming up with for now.
[pan_log]
SEDCMD-pan_sane_quotes10 = s/(?<=,)""(?=,)/REQUOTEME/g
SEDCMD-pan_sane_quotes20 = s/\\(?=[",\\])/\\\\/g
SEDCMD-pan_sane_quotes30 = s/""/\\"/g
SEDCMD-pan_sane_quotes40 = s/REQUOTEME/""/g
step one, find any ,"", and replace "" with REQUOTEME step two, find any \ followed by a " (boundary) or a , (delimiter) or a \ (another backslash) and escape them with what splunk considers an escape character (backslash) step three, find any remaining "" where in this case they were not surrounded by commas and escape them step four, put the "" back where REQUOTEME is holding our place
Let me know if this can be improved or if there's another better solution.
@lumpymilk, I have the same issue. Thank you for the proposed solution! But can you explain it a bit to me?
I see that in my case this issue happens only when the misc field ends with \"
. The presence of an unescaped double quote in its value doesn't cause this parsing issue despite the information about DELIM in the spec file. Maybe Splunk Cloud (Version: 9.0.2303.202 and Splunk_TA_paloalto: 7.1.0) behaves a bit different or something else may be in our setup that helps to avoid this issue. I don't know for sure.
Also, I'm not sure I fully understand following modifications and would appreciate it if you explain them:
\",
with \\",
in your SEDCMD-pan_sane_quotes20 but also checking for ,
and \
? I mean something like SEDCMD-pan_sane_quotes20 = s/\\(?=",)/\\\\/g
.""
with \"
instead of \"\"
in your SEDCMD-pan_sane_quotes40?Thank you!
"The presence of an unescaped double quote in its value doesn't cause the parsing issue" So far, I haven't seen an unescaped double quote. That is, pan is very consistent with escaping " according to RFC 4180, where a literal " must be preceded by another " making "". I think the confusing part of us is that we are so used to see \ as an escape character that when we look at the pan data in RFC 4180 CSV it looks like there are escapes where there really aren't.
SEDCMD-pan_sane_quotes20, following the assumption that according to RFC 4180 the backslash is NEVER escaping anything at all and is just literally a backslash. The following hypothetical segment can occur: ,"data\ending\with\two\backslashes\",5,10,"domain\user", As you can see the first populated column is "data\ending\with\two\backslashes\" and the last populated field is "domain\user" s/\(?=",)/\\/g will give the following result ,"data\ending\with\two\backslashes\\",5,10,"domain\user", and then DELIMS would see that first field as "data\ending\with\two\backslashes\\",5,10," because the closing quotes on that first field are now escaped (according to DELIMS).
SEDCMD-pan_sane_quotes20 would convert the string with this result: ,"data\ending\with\two\backslashes\\",5,10,"domain\user", In this case the first field remains "data\ending\with\two\backslashes\\" which looks weird in raw but in the field rendering looks just right.
for pan_sane_quotes40, the reason I don't replace "" with \"\" is because pan is following RFC 4180 in their raw log. So according to pan, "" is a literal " already. All I am doing is translating that to DELIMS. DELIMS always expects \ as an escape character. So RFC 4180 "" == \" in DELIMS.
Let me know what you think and thanks for sharing your perspective.
Thank you for the explanations!
I see now that SEDCMD-pan_sane_quotes20 should help to avoid any issues with DELIMS if there are something like \\
, \"
or \,
that can happen in values of a field. And SEDCMD-pan_sane_quotes30 is completely clear too.
Describe the bug
The "Palo Alto Networks Add-on for Splunk" does not correctly parse certain logs for the pan:threat sourcetype due to DELIMS behavior. The [extract_threat] stanza in transforms.conf does not work with certain values in the "misc" column.
Expected behavior
The add-on should complete the extraction of pan:threat events as long as the values within the columns are valid and expected.
Current behavior
The "misc" column/field for the pan:threat sourcetype can contain a Windows directory path. When the value for misc ends with a backslash such as "C:\Users\", the parsing for the remaining fields will break. This is due to the directory backslash performing an escape on the quotation mark which is supposed to end the field. Splunk's DELIMS behavior will then break for the remaining fields. For example: sequence_number, severity, direction,etc. are not present while searching events.
Possible solution
Steps to reproduce
Jan 7 09:58:26 192.168.1.10 <13>Jan 7 09:58:26 PA01.company.com 1,2021/01/07 09:58:25,000000000001,THREAT,file,2049,2021/01/07 09:58:25,10.0.0.47,1.1.1.1,0.0.0.0,0.0.0.0,External Traffic Outbound,production\user1,,web-browsing,vsys1,Drop_Extranet2lan,Drop_Extranet2fw,ethernet1/6,ethernet1/5,SplunkForwarder,2021/01/07 09:58:25,37500209,1,57931,443,0,0,0x1102000,tcp,alert,"C:\Users\user1\OneDrive - Company\Documents\Studio 2019\",Microsoft Word 2007 DOCX File(52022),computer-and-internet-info,low,client-to-server,6885125841316127911,0x2000000000000000,10.0.0.0-10.255.255.255,Canada,0,,0,,,4,,,,,,,,0,19,0,0,0,,PA01,company.com/files/270945/big_step.js,,,,0,,0,,N/A,unknown,AppThreat-8362-6491,0x0,0,4294967295,company.com/files/270945/big_step.js Jan 7 09:58:26 192.168.1.10 <13>Jan 7 09:58:26 PA01.company.com 1,2021/01/07 09:58:25,000000000001,THREAT,file,2049,2021/01/07 09:58:25,10.0.0.47,1.1.1.1,0.0.0.0,0.0.0.0,External Traffic Outbound,production\user2,,web-browsing,vsys1,Drop_Extranet2lan,Drop_Extranet2fw,ethernet1/6,ethernet1/5,SplunkForwarder,2021/01/07 09:58:25,37500209,2,57931,443,0,0,0x1102000,tcp,alert,"C:\Users\user2\OneDrive - Company\Documents\Studio 2019\",Microsoft Word 2007 DOCX File(52140),computer-and-internet-info,low,client-to-server,6885125841316127910,0x2000000000000000,10.0.0.0-10.255.255.255,Canada,0,,0,,,4,,,,,,,,0,19,0,0,0,,PA01,company.com/files/270945/big_step.js,,,,0,,0,,N/A,unknown,AppThreat-8362-6491,0x0,0,4294967295,company.com/files/270945/big_step.js Jan 7 09:58:26 192.168.1.10 <13>Jan 7 09:58:26 PA01.company.com 1,2021/01/07 09:58:25,000000000001,THREAT,file,2049,2021/01/07 09:58:25,10.0.0.47,1.1.1.1,0.0.0.0,0.0.0.0,External Traffic Outbound,production\user1,,web-browsing,vsys1,Drop_Extranet2lan,Drop_Extranet2fw,ethernet1/6,ethernet1/5,SplunkForwarder,2021/01/07 09:58:25,37500209,2,57931,443,0,0,0x1102000,tcp,alert,"C:\Users\user1\OneDrive - Company\Documents\Studio 2019\",Microsoft MSOFFICE(52033),computer-and-internet-info,low,client-to-server,6885125841316127909,0x2000000000000000,10.0.0.0-10.255.255.255,Canada,0,,0,,,4,,,,,,,,0,19,0,0,0,,PA01,company.com/files/270945/big_step.js,,,,0,,0,,N/A,unknown,AppThreat-8362-6491,0x0,0,4294967295,company.com/files/270945/big_step.js Jan 7 09:58:26 192.168.1.10 <13>Jan 7 09:58:26 PA01.company.com 1,2021/01/07 09:58:25,000000000001,THREAT,file,2049,2021/01/07 09:58:25,10.0.0.47,1.1.1.1,0.0.0.0,0.0.0.0,External Traffic Outbound,production\user2,,web-browsing,vsys1,Drop_Extranet2lan,Drop_Extranet2fw,ethernet1/6,ethernet1/5,SplunkForwarder,2021/01/07 09:58:25,37500209,11,57931,443,0,0,0x1102000,tcp,alert,"C:\Users\user2\OneDrive - Company\Documents\Studio 2019\",ZIP(52004),computer-and-internet-info,low,client-to-server,6885125841316127908,0x2000000000000000,10.0.0.0-10.255.255.255,Canada,0,,0,,,4,,,,,,,,0,19,0,0,0,,PA01,company.com/files/270945/big_step.js,,,,0,,0,,N/A,unknown,AppThreat-8362-6491,0x0,0,4294967295,company.com/files/270945/big_step.js
Screenshots
N/A
Context
Doing sequence number analysis is impacted as this field is not extracted for all events.
Your Environment
Confirmed this issue exists in two environments: