Open Cyb3rWard0g opened 5 years ago
I was able to Access the Spark executor err logs and I got more information:
19/06/21 02:51:26 ERROR TaskContextImpl: Error in TaskCompletionListener
java.lang.IllegalArgumentException: Failed to parse filter: {"bool":{"should":[{"match":{"module_loaded":"c:\windows\system32\samlib.dll c:\windows\system32\hid.lab"}}]}}
at org.elasticsearch.hadoop.rest.query.QueryUtils.parseFilters(QueryUtils.java:74)
at org.elasticsearch.hadoop.rest.RestService.createReader(RestService.java:453)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.reader$lzycompute(AbstractEsRDDIterator.scala:49)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.reader(AbstractEsRDDIterator.scala:42)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.close(AbstractEsRDDIterator.scala:81)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.closeIfNeeded(AbstractEsRDDIterator.scala:74)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator$$anonfun$1.apply$mcV$sp(AbstractEsRDDIterator.scala:54)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator$$anonfun$1.apply(AbstractEsRDDIterator.scala:54)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator$$anonfun$1.apply(AbstractEsRDDIterator.scala:54)
at org.elasticsearch.spark.rdd.CompatUtils$1.onTaskCompletion(CompatUtils.java:112)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:117)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:117)
at org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:130)
at org.apache.spark.TaskContextImpl$$anonfun$invokeListeners$1.apply(TaskContextImpl.scala:128)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:128)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:116)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.codehaus.jackson.JsonParseException: Unrecognized character escape 'w' (code 119)
at [Source: java.io.StringReader@63fe6467; line: 1, column: 51]
I seems to be more of a syntax issue.
If I escape the backslashes with another backslash (not 3 as my initial queries above)
module_loaded = spark.sql(
'''
SELECT event_id,
host_name,
process_name,
module_loaded
FROM sysmon_events
WHERE event_id = 7
AND module_loaded IN ("c:\\windows\\system32\\samlib.dll","c:\\windows\\system32\\hid.lab")
'''
)
I get the following logical plan. As you can see, the backslashes are not taken and it just shows a long string with all the text of the module path together
== Parsed Logical Plan ==
'Project ['event_id, 'host_name, 'process_name, 'module_loaded]
+- 'Filter (('event_id = 7) && 'module_loaded IN (c:windowssystem32samlib.dll,c:windowssystem32hid.lab))
+- 'UnresolvedRelation `sysmon_events`
@Cyb3rWard0g its tricky and painful and not related to Elasticsearch.
Since Spark and Elastic are written in Scala and Java respectively (both being JVM languages), the pySpark code first gets converted to Scala code. Now in Java (and scala) "\" is an "escape" character so if you were writing the last code directly in Scala or Java i.e.
module_loaded = spark.sql( ''' SELECT event_id, host_name, process_name, module_loaded FROM sysmon_events WHERE event_id = 7 AND module_loaded IN ("c:\\windows\\system32\\samlib.dll","c:\\windows\\system32\\hid.lab") ''' )
then it would have resulted in
+- 'Filter (('event_id = 7) && 'module_loaded IN (c:\windows\system32\samlib.dll,c:\windows\system32\hid.lab))
as the first "\" will be consumed in escaping the second one.
However, things gets interesting when code gets converted from Python to Scala. It will work this way.
Input Unicode characters (just taking partial code) to scala would be:
AND module_loaded IN ("c:\\windows\\system32\\samlib.dll","c:\\windows\\system32\\hid.lab")
which when converted to Scala code (i.e. scala code is generated) will become (considering the explanation above about first "\" being escape character) ->
AND module_loaded IN ("c:\windows\system32\samlib.dll","c:\windows\system32\hid.lab")
which when converted to Java Bytecode will become ->
AND module_loaded IN ("c:windowssystem32samlib.dll","c:windowssystem32hid.lab")
and thats what is happening.
So if the end result you want is single "\" then use 4 "\" in your code and if you want the end result to have 2 "\" (i.e. \) then use 8 "\" in python code.
Hope it helps!!
@Cyb3rWard0g Also, if possible rather than storing paths with "\" in Elasticsearch store them with Forward Slashes, unless there is a stringent requirement to have them with "\".
I have a similar problem with scala code.
Elasticsearch version (bin/elasticsearch --version):6.3.1
Plugins installed: []
JVM version (java -version):1.8
OS version (uname -a if on a Unix-like system):linux
Description of the problem including expected versus actual behavior:
val sparkConf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .set("spark.es.index.auto.create", "true") .set("spark.es.nodes", "nodes") .set("spark.es.port", "9200") .set("spark.es.resource", "index_name") .set("spark.es.nodes.wan.only", "true").set("spark.master", "local[4]")
val essessionDataFrame = spark.sqlContext.read .format("org.elasticsearch.spark.sql") .option("inferSchema", "true").load("index_name/type") essessionDataFrame.createOrReplaceTempView(hiveTableName) essessionDataFrame.show() spark.catalog.refreshTable(hiveTableName) val sql = s"SELECT DISTINCT memid FROM $hiveTableName WHERE town_no IN ('21400','23500')" spark.sql(sql).rdd.count()
What kind an issue is this?
Issue description
I am trying to use a basic SQL IN statement to match a field value
module_loaded
from security event logs with a list of values.SECURITY EVENT SAMPLE (ONE)
When I run the following query in Kibana:
event_id:7 AND process_name:"powershell.exe" AND module_loaded:*samlib.dll
I get to the event I want.
DATA SAMPLE (TWO) When I run the following query in Kibana:
event_id:7 AND process_name:"powershell.exe" AND module_loaded:*hid.dll
I get to the event I want:
Now I want to reproduce something similar with Apache SparkSQL via PySparK. I start by initializing the SprakSession and registering a SQL table mapped to the index I am using to find the two records above.
Now I can run SQL queries on it. I want to use the SQL IN statement and replicate the first query I ran early.
That works. However, what if I want to have both values
c:\windows\system32\samlib.dll
andc:\windows\system32\hid.dll
inside of theIN
statement? it unfortunately fails:Version Info
OS: : Official Elasticsearch Docker Image 7.1.0 (CentOS Linux 7 (Core)) JVM : OpenJDK 1.8.0_191 (SPARK) Hadoop/Spark: 2.4.3 ES-Hadoop : 7.1.0 ES : 7.1.0