Closed yksee closed 1 year ago
Hi @yksee,
Hmmm... looking at the two timestamps.. the first is UTC and the second is a timezone aware datetime... so on they look the 'same', meaning the UTC gives hour 16 and the timezone aware one gives 8 +08 = 16. In general storing date/time in UTC is almost always the right thing to do...
Perhaps I'm missing something more subtle... if so please respond and we'll take a deeper look at the issue.
Thanks for the reply. Let me elaborate with this conn.log:
1647504456.038126 CLZwfq2hHnVH2XKvQg 192.168.50.100 50527 20.197.71.89 443 tcp - - - - SH - - 0 F 1 40 0 0 - 1647504460.020920 CQ9qtc4cGl2PUgVkma 192.168.50.243 5353 224.0.0.251 5353 udp dns - - - S0 - - 0 D 1 89 0 0 - 1647504479.519793 Ch22k1203r7uCUhZ6a 192.168.50.100 50567 52.182.143.212 443 tcp ssl 1.235504 6226 5256 SF - - 0 ShADdaFf 12 6726 10 5668 -
I got all 3 entries with the same timestamp 2022, 3, 17, 16, 8, 11, 806720. This is different from what I got using date -d, : Thu 17 Mar 2022 04:07:36 PM +08 Thu 17 Mar 2022 04:07:40 PM +08 Thu 17 Mar 2022 04:07:59 PM +08
Thanks for the details and example conn log, this will really help me understand what's going on. I'll take a look soon and try to make a fix so that the datetimes are correct.
Thanks! For your info, log_to_dataframe seems to give the correct datetimes.
Just stumbled upon this issue, which is a bit of a blocker if the problem still persists - is it confirmed and/or fixed? Otherwise, I can run to offer the conversion code on a couple of files here, too, to gather extra info.
@ralphholz yeah I let this slip through the cracks.. my past self left me some notes... I think pyarrow used to have a couple of issues with timestamps which are now gone.. so I can clean those up. I'll try to fit this in today...
Notes from past self:
# Note: We're going to import pyarrow but it currently has an open issue around supporting time deltas
# - https://issues.apache.org/jira/browse/ARROW-6780 so we have to convert timedelta fields to str
# Also see:
# - https://stackoverflow.com/questions/53893554/transfer-and-write-parquet-with-python-and-pandas-got-timestamp-error
#
Okay.. so just capturing some info here...
1) examples/zeek_to_parquet.py works fine...
uid id.orig_h id.orig_p ... resp_pkts resp_ip_bytes tunnel_parents
ts ...
2022-03-17 08:07:36.038126080 CLZwfq2hHnVH2XKvQg 192.168.50.100 50527 ... 0 0 None
2022-03-17 08:07:40.020920064 CQ9qtc4cGl2PUgVkma 192.168.50.243 5353 ... 0 0 None
2022-03-17 08:07:59.519793152 Ch22k1203r7uCUhZ6a 192.168.50.100 50567 ... 10 5668 None
2) examples/zeek_to_parquet_with_spark.py seems to do some weird rounding
In [10]: df
Out[10]:
ts uid id_orig_h id_orig_p ... orig_ip_bytes resp_pkts resp_ip_bytes tunnel_parents
0 2022-03-17 08:08:32 CLZwfq2hHnVH2XKvQg 192.168.50.100 50527 ... 40 0 0 -
1 2022-03-17 08:08:32 CQ9qtc4cGl2PUgVkma 192.168.50.243 5353 ... 89 0 0 -
2 2022-03-17 08:08:32 Ch22k1203r7uCUhZ6a 192.168.50.100 50567 ... 6726 10 5668 -
Ah.. found the issue... so the TLDR is that we needed to read the Zeek 'ts' field as a double (instead of float) and then convert to a Spark TimeStamp type.
❯ python zeek_to_parquet_with_spark.py /Users/briford/work/zat/data/conn_date_issue.log foo.parquet
23/01/26 14:37:37 WARN Utils: Your hostname, Bosco.local resolves to a loopback address: 127.0.0.1; using 192.168.50.116 instead (on interface en0)
23/01/26 14:37:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/01/26 14:37:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
/Users/briford/work/zat/data/conn_date_issue.log --> foo.parquet
+--------------------------+------------------+--------------+---------+--------------+---------+-----+-------+--------+----------+----------+----------+----------+----------+------------+--------+---------+-------------+---------+-------------+--------------+
|ts |uid |id_orig_h |id_orig_p|id_resp_h |id_resp_p|proto|service|duration|orig_bytes|resp_bytes|conn_state|local_orig|local_resp|missed_bytes|history |orig_pkts|orig_ip_bytes|resp_pkts|resp_ip_bytes|tunnel_parents|
+--------------------------+------------------+--------------+---------+--------------+---------+-----+-------+--------+----------+----------+----------+----------+----------+------------+--------+---------+-------------+---------+-------------+--------------+
|2022-03-17 02:07:36.038126|CLZwfq2hHnVH2XKvQg|192.168.50.100|50527 |20.197.71.89 |443 |tcp |- |0.0 |0 |0 |SH |null |null |0 |F |1 |40 |0 |0 |- |
|2022-03-17 02:07:40.02092 |CQ9qtc4cGl2PUgVkma|192.168.50.243|5353 |224.0.0.251 |5353 |udp |dns |0.0 |0 |0 |S0 |null |null |0 |D |1 |89 |0 |0 |- |
|2022-03-17 02:07:59.519793|Ch22k1203r7uCUhZ6a|192.168.50.100|50567 |52.182.143.212|443 |tcp |ssl |1.235504|6226 |5256 |SF |null |null |0 |ShADdaFf|12 |6726 |10 |5668 |- |
+--------------------------+------------------+--------------+---------+--------------+---------+-----+-------+--------+----------+----------+---
So I've released a new PyPI version (0.4.6) with this fix.. please try it out and let me know if it's working for you.
Okay hopefully the fix/new release are working for @yksee and @ralphholz. I'm going to close this ticket but please let me know if you have any issues and we can reopen.
@brifordwylie Thanks! Things are working well now.
@yksee beat me to it. I just tried this on a very large data set, taking 10 random samples, and all timestamps matched those in the input files, also taking time zones into account. Many thanks!
When I do a zeek_to_parquet_with_spark conversion on ssl.log from https://data.kitware.com/#collection/58d564478d777f0aef5d893a/folder/58ebc49a8d777f16d095fd00, the timestamp starts from 2012, 03, 16, 20, 30, 33, 873408. I followed the code from https://nbviewer.org/github/SuperCowPowers/zat/blob/main/notebooks/Zeek_to_Parquet.ipynb.
This is slight different from what there is in the ssl.log, where the ts starts from 1331901000.000000. Doing a date -e @'1331901000.000000 ' gives Fri 16 Mar 2012 08:30:00 PM +08