Velocidex / velociraptor

Digging Deeper....
https://docs.velociraptor.app/
Other
2.8k stars 469 forks source link

FR: Add timestamp formatting function #3594

Open predictiple opened 1 week ago

predictiple commented 1 week ago

It would be nice to have access to the predefined formats and custom output formatting that Go's time.Format provides (https://pkg.go.dev/time#pkg-constants).

We can currently build custom formats using format() but it can get messy. For example if you need a timestamp string to pass to an external system that demands milliseconds then it requires some string manipulation because we only have nanoseconds available. But if the original parsed timestamp had no fractional seconds then .Nanosecond will be ~Null~ (actually int 0) which breaks the formatting. This then requires messy VQL to compensate like this:

LET eventDate = format(
    format="%v.%v",
    args=[parse_string_with_regex(string=Timestamp.String, regex='''(^.+)Z''').g1, if(
      condition=Timestamp.Nanosecond,
      then=parse_string_with_regex(
        string=str(
          str=Timestamp.Nanosecond),
        regex='''(^\d\d\d)''').g1,
      else="000")])

If we could specify something like Timestamp.Format.StampMilli that would be more elegant.

scudette commented 1 week ago

Im not sure I understand the problem with nanosecond -> microsecond conversion? isnt it just multiplying by 1000000

predictiple commented 1 week ago

Hopefully this example will help.

Say you have a timestamps like this 2024-02-02T04:42:00Z and like this 2024-02-02T04:42:00.123456Z.

They could originate in the same log or could come from separate sources which need to be post-processed to a consistent format so that you can send them to an API, as strings, which requires them to be formatted consistently like this 2024-02-02T04:42:00.000 and this 2024-02-02T04:42:00.123

Parsing them with timestamp() means the Nanosecond component will be int 0 and 123456000 respectively. Multiplication can help with the second one but not the first because it won't pad it out to become "000". Truncating them, as I did in the above example using regex, will also work for the second timestamp but not the first.

Being able to specify a string format should do the necessary padding of the 1st timestamp and also the trimming of the 2nd timestamp.

scudette commented 1 week ago

integers or floats are always formatted according the format string directive in format() . The example you gave with 0 milliseconds can be formatted with 0 pad both before and after the decimal dot as required. For example to pad to 2 digits before the dot and 3 digits after:

LET T <= timestamp(epoch="2024-02-02T04:42:00Z")

SELECT format(format="%d!%02d-%02dT%02d:%02d:%06.3fZ", args=[
  T.Year, T.Month, T.Day, T.Hour, T.Minute, T.Nanosecond / 1000000000.0
]) FROM scope()

Gives 2024!02-02T04:42:00.000Z (Note that if you try this in the notebook you need to make it look different from an ISO timestamp or the GUI will reformat it according to the user's timezone preferences - thats why I have the ! in there).

image

predictiple commented 1 week ago

Thanks that's good to know, and I didn't know that which highlights the point I'm making about such an approach being messy. I don't think we should expect users to be proficient with that formatting syntax.

The time.Time layouts are much more comprehensible:

LET T <= timestamp(epoch="2024-02-02T04:42:00Z")
LET L <= "2006-01-02T15:04:05.000"
SELECT format_time(layout=L, time=T) FROM scope()  -- or something like that would be awesome

Maybe it could even be extra functionality within timestamp() or format() rather than a separate function?

predictiple commented 1 week ago

For anyone dealing with milliseconds/microseconds/nanoseconds in future, this is what's currently needed:

LET T <= timestamp(epoch="2024-02-02T04:42:05.26Z")

SELECT format(format="%d!%02d-%02dT%02d:%02d:%02d.%03d", args=[
  T.Year, T.Month, T.Day, T.Hour, T.Minute, T.Second, int(int=T.Nanosecond / 1000000)
]) FROM scope()

because division always returns a float (surprisingly!), so this applies to T.Nanosecond (an int) divided by anything too.

Screenshot from 2024-07-06 06-18-06

edit: Alternatively you can use this unintuitive %03.f float syntax in the format specification:

LET T <= timestamp(epoch="2024-02-02T04:42:05.26Z")

SELECT format(format="%d!%02d-%02dT%02d:%02d:%02d.%03.f", args=[
  T.Year, T.Month, T.Day, T.Hour, T.Minute, T.Second, T.Nanosecond / 1000000
]) FROM scope()
scudette commented 1 week ago

I don't necessary agree that the timestamp formatting system in the time package is any more readable than the standard formatting directives. It's certainly not as powerful in the type of formatting it can provide.

Perhaps we just need a bunch of vql functions to export things into standard formats to emulate the formatting constants in the time package?

I'm actually also not sure that in the real world this comes up too frequently as most modern systems should be using iso format. We also don't want to encourage people to actually format their times because we want to keep times in iso format as much as possible.

If we were to expose a set of common functions what would be a reasonable useful set? This helps to define a need to see if this needs to be flexible enough

predictiple commented 1 week ago

This is definitely a usability feature rather than a technical limitation because string formatting is very powerful, as you've pointed out. But for most people, including me, this means spending some time looking for examples on stackoverflow.

I think there are at least a few popular REST APIs that insist on milliseconds or microseconds. In my case the former, when dealing with IRIS. Even though it's basically accepting ISO format it expects the milliseconds to be explicitly 3 digits. So for example these won't be accepted by the API even though they are valid ISO-8601/RFC-3339:

So maybe we can get by with formatting constants for milli/micro/nano variants for RFC-3339? I see that the time package already has a RFC3339Nano layout constant which I could have used with a simple regex to extract up to the millisecond.

Possibly some others that might get used in printed reports:

predictiple commented 1 week ago

The argument in favour of supporting timestamp "layout" formatting is that if you google "golang format timestamp" most of the results, like this one, will show examples/tutorials of it being done that way, i.e. only ancient examples using string formatting.

We also use it for parsing unusual timestamp stings so it's consistent to also use it for output formatting.

I don't know if it adds a lot of complexity or bloat. Maybe it does. I hoped it would be something we could get relatively "for free" from the time package.

scudette commented 1 week ago

I dont think we should be supporting or encouraging any of the timestamp formats you mentioned in your comments (especially not the US style ones). We should only support RFC-3339 timestamps out of the box and make it harder to produce other timestamps.

I think this issue is actually highlighting a different problem - You stated that the system you use to consume the RFC 3339 timestamps requires exactly 3 significant decimal digits. This is not a requirement specified in the RFC as far as I can see so it is likely that that system actually does not support RFC 3339 timestamps correctly and we should file a bug against that upstream. I would be very surprised if the upstream system does not accept properly formatted RFC 3339 timestamps which may or may not include fractional second - likely if they use a standard library to parse the times it should work out of the box.

Even if we allowed to use RFC3339Nano to format the times you still need to massage them with regex to comply with this weird API requirement so this is a hack at best.

To me this requirement is very weird - it is like saying that your API will only accept numbers with exactly 3 digits of precision no more and not less (e.g. 1.000 , 2.000 etc ) - I just cant think of a valid reason to add this restriction

predictiple commented 1 week ago

Yes it is weird, but I confirmed it experimentally and it matches up with their API examples. I didn't look into their code to find the reason and they don't seem to document their timestamp requirements anywhere. The Iris timeline artifact now works robustly (previously it would not work with any artifact result sets having timestamps without fractional seconds) and my interest ends there. In the past I've come across other systems with similarly weird timestamp requirements but we could handle those too with string formatting.

The other use cases are not related to data-interchange but rather for human consumption. For example someone may want to write out times to a markdown table for a report. It could be that some Americans want to see US time format in the GUI. Maybe the US military uses Velociraptor and would like to see their weird US military formats. They are not my use cases but they'd seem reasonable if they were to come up.

It doesn't come up often (at least that we are aware of) but there are examples on Discord here and here which show that working with timestamp components and formatting is not as easy as it could be. It looks to me like this is the exact reason why the Go layout formatting exists: It spares the user from having to understand timestamp components (as per 2nd example) and advanced string formatting (as per 1st example). The format constants are a nice-to-have bonus but not essential.

As I mentioned it's about usability, not capability. A convenience feature. Maybe the demand for it is low, maybe the implementation cost is high - I really don't know. I don't think that having it would encourage people to go wild with timestamp formats.

scudette commented 1 week ago

There are two ways this can be solved :

  1. The question of what time formats are acceptable applies to all times everywhere - this is a similar problem to the timezone choice - they just influence the way the json serializes marshals the time object to json. This applies automatically to all timestamps within the query.

  2. The second option is to allow for a timestamp_format() option which uses the alternative go specific formatting specification. Itself not very standard but can be used as a one off for a column and as you say it is really for convenience (although it is not really standard and may be more complex to use than the more standard format() function). The standard format() is available in all languages I am aware of and work exactly the same while the timestamp formatting specification is really specific to golang and it is actually confusing to use (the meaning of the number is very specific and it is hard to actually use).

I think these solve different problems - the first applies automatically to all timestamps when the second is just for a specific timestamp. I am not sure if IRIS has other timestamps that need massaging as well (I still think it is worth filing an issue against that project if they dont properly support RFC 3339 timestamps).

predictiple commented 1 week ago

Yes you're right, they are 2 separate problems/solutions although the 2nd can be an ad-hoc substitute in the absence of the 1st.

The 2nd is what this FR is about. Timestamps are a core aspect of DFIR, arguably THE core one. We want users to be able to work with them as easily as possible. It's unreasonable (I think) to expect users to have a deep understanding of timestamp objects and string formatting. That example on Discord where the formatting verb equates to a method on the Month type will be a complete surprise to 99.99% of users. image We have to ask ourselves: how are users supposed to know that? Unless they are developers I don't see how they would even know where to start looking for solutions.

With layout formatting users don't need to know any of that. It sidesteps the hard part of the problem. And as I mentioned there are plenty of examples and tutorials whereas for the string formatting approach there aren't, probably because in Go it's considered antiquated to do it that way.

If there are shortcomings to the layouts approach then we still have format() to fall back on, or people could still use it if they prefer it. We already use layouts for parsing custom formats. We could have used something like "yyyy-MM-dd’T’HH:mm:ss" for parsing but the layouts approach is definitely better and easier on the eye. So this is just the reverse situation. Users only need to learn 1 thing to do both.

There's no urgency to this FR. I was just reminded of it due to the (now solved) issue with Iris, but they have bigger issues like not even allowing the timestamp format to be changed in their UI - it's always American :roll_eyes: Screenshot from 2024-07-08 08-43-08