SuperCowPowers / zat

Zeek Analysis Tools (ZAT): Processing and analysis of Zeek network data with Pandas, scikit-learn, Kafka and Spark
MIT License
423 stars 107 forks source link

Zeek to Parquet With Spark Gives Different Timestamp #147

Closed yksee closed 1 year ago

yksee commented 1 year ago

When I do a zeek_to_parquet_with_spark conversion on ssl.log from https://data.kitware.com/#collection/58d564478d777f0aef5d893a/folder/58ebc49a8d777f16d095fd00, the timestamp starts from 2012, 03, 16, 20, 30, 33, 873408. I followed the code from https://nbviewer.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Parquet.ipynb.

This is slight different from what there is in the ssl.log, where the ts starts from 1331901000.000000. Doing a date -e @'1331901000.000000 ' gives Fri 16 Mar 2012 08:30:00 PM +08

brifordwylie commented 1 year ago

Hi @yksee,

Hmmm... looking at the two timestamps.. the first is UTC and the second is a timezone aware datetime... so on they look the 'same', meaning the UTC gives hour 16 and the timezone aware one gives 8 +08 = 16. In general storing date/time in UTC is almost always the right thing to do...

Perhaps I'm missing something more subtle... if so please respond and we'll take a deeper look at the issue.

yksee commented 1 year ago

Thanks for the reply. Let me elaborate with this conn.log:

separator \x09

set_separator ,

empty_field (empty)

unset_field -

path conn

open 2022-04-01-22-44-05

fields ts uid id.orig_h id.orig_p id.resp_h id.resp_p proto service duration orig_bytes resp_bytes conn_state local_orig local_resp missed_bytes history orig_pkts orig_ip_bytes resp_pkts resp_ip_bytes tunnel_parents

types time string addr port addr port enum string interval count count string bool bool count string count count count count set[string]

1647504456.038126 CLZwfq2hHnVH2XKvQg 192.168.50.100 50527 20.197.71.89 443 tcp - - - - SH - - 0 F 1 40 0 0 - 1647504460.020920 CQ9qtc4cGl2PUgVkma 192.168.50.243 5353 224.0.0.251 5353 udp dns - - - S0 - - 0 D 1 89 0 0 - 1647504479.519793 Ch22k1203r7uCUhZ6a 192.168.50.100 50567 52.182.143.212 443 tcp ssl 1.235504 6226 5256 SF - - 0 ShADdaFf 12 6726 10 5668 -

I got all 3 entries with the same timestamp 2022, 3, 17, 16, 8, 11, 806720. This is different from what I got using date -d, : Thu 17 Mar 2022 04:07:36 PM +08 Thu 17 Mar 2022 04:07:40 PM +08 Thu 17 Mar 2022 04:07:59 PM +08

brifordwylie commented 1 year ago

Thanks for the details and example conn log, this will really help me understand what's going on. I'll take a look soon and try to make a fix so that the datetimes are correct.

yksee commented 1 year ago

Thanks! For your info, log_to_dataframe seems to give the correct datetimes.

ralphholz commented 1 year ago

Just stumbled upon this issue, which is a bit of a blocker if the problem still persists - is it confirmed and/or fixed? Otherwise, I can run to offer the conversion code on a couple of files here, too, to gather extra info.

brifordwylie commented 1 year ago

@ralphholz yeah I let this slip through the cracks.. my past self left me some notes... I think pyarrow used to have a couple of issues with timestamps which are now gone.. so I can clean those up. I'll try to fit this in today...

Notes from past self:

# Note: We're going to import pyarrow but it currently has an open issue around supporting time deltas
#    - https://issues.apache.org/jira/browse/ARROW-6780 so we have to convert timedelta fields to str
# Also see:
#   - https://stackoverflow.com/questions/53893554/transfer-and-write-parquet-with-python-and-pandas-got-timestamp-error
#
brifordwylie commented 1 year ago

Okay.. so just capturing some info here...

1) examples/zeek_to_parquet.py works fine...

                                              uid       id.orig_h  id.orig_p  ... resp_pkts  resp_ip_bytes tunnel_parents
ts                                                                            ...
2022-03-17 08:07:36.038126080  CLZwfq2hHnVH2XKvQg  192.168.50.100      50527  ...         0              0           None
2022-03-17 08:07:40.020920064  CQ9qtc4cGl2PUgVkma  192.168.50.243       5353  ...         0              0           None
2022-03-17 08:07:59.519793152  Ch22k1203r7uCUhZ6a  192.168.50.100      50567  ...        10           5668           None

2) examples/zeek_to_parquet_with_spark.py seems to do some weird rounding

In [10]: df
Out[10]:
                   ts                 uid       id_orig_h  id_orig_p  ... orig_ip_bytes  resp_pkts resp_ip_bytes tunnel_parents
0 2022-03-17 08:08:32  CLZwfq2hHnVH2XKvQg  192.168.50.100      50527  ...            40          0             0              -
1 2022-03-17 08:08:32  CQ9qtc4cGl2PUgVkma  192.168.50.243       5353  ...            89          0             0              -
2 2022-03-17 08:08:32  Ch22k1203r7uCUhZ6a  192.168.50.100      50567  ...          6726         10          5668              -
brifordwylie commented 1 year ago

Ah.. found the issue... so the TLDR is that we needed to read the Zeek 'ts' field as a double (instead of float) and then convert to a Spark TimeStamp type.

❯ python zeek_to_parquet_with_spark.py /Users/briford/work/zat/data/conn_date_issue.log foo.parquet
23/01/26 14:37:37 WARN Utils: Your hostname, Bosco.local resolves to a loopback address: 127.0.0.1; using 192.168.50.116 instead (on interface en0)
23/01/26 14:37:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/01/26 14:37:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
/Users/briford/work/zat/data/conn_date_issue.log --> foo.parquet
+--------------------------+------------------+--------------+---------+--------------+---------+-----+-------+--------+----------+----------+----------+----------+----------+------------+--------+---------+-------------+---------+-------------+--------------+
|ts                        |uid               |id_orig_h     |id_orig_p|id_resp_h     |id_resp_p|proto|service|duration|orig_bytes|resp_bytes|conn_state|local_orig|local_resp|missed_bytes|history |orig_pkts|orig_ip_bytes|resp_pkts|resp_ip_bytes|tunnel_parents|
+--------------------------+------------------+--------------+---------+--------------+---------+-----+-------+--------+----------+----------+----------+----------+----------+------------+--------+---------+-------------+---------+-------------+--------------+
|2022-03-17 02:07:36.038126|CLZwfq2hHnVH2XKvQg|192.168.50.100|50527    |20.197.71.89  |443      |tcp  |-      |0.0     |0         |0         |SH        |null      |null      |0           |F       |1        |40           |0        |0            |-             |
|2022-03-17 02:07:40.02092 |CQ9qtc4cGl2PUgVkma|192.168.50.243|5353     |224.0.0.251   |5353     |udp  |dns    |0.0     |0         |0         |S0        |null      |null      |0           |D       |1        |89           |0        |0            |-             |
|2022-03-17 02:07:59.519793|Ch22k1203r7uCUhZ6a|192.168.50.100|50567    |52.182.143.212|443      |tcp  |ssl    |1.235504|6226      |5256      |SF        |null      |null      |0           |ShADdaFf|12       |6726         |10       |5668         |-             |
+--------------------------+------------------+--------------+---------+--------------+---------+-----+-------+--------+----------+----------+---

So I've released a new PyPI version (0.4.6) with this fix.. please try it out and let me know if it's working for you.

brifordwylie commented 1 year ago

Okay hopefully the fix/new release are working for @yksee and @ralphholz. I'm going to close this ticket but please let me know if you have any issues and we can reopen.

yksee commented 1 year ago

@brifordwylie Thanks! Things are working well now.

ralphholz commented 1 year ago

@yksee beat me to it. I just tried this on a very large data set, taking 10 random samples, and all timestamps matched those in the input files, also taking time zones into account. Many thanks!