brimdata / zed

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.38k stars 67 forks source link

Creating record fields from regex named captures #4899

Open zmajeed opened 10 months ago

zmajeed commented 10 months ago

Is there a way to add a field with RE2 named capture groups?

zq -Z '/total (?P<total>\S+)/' - <<EOF
{
  msg: "total 27"
}
EOF

Output is

{
  msg: "total 27"
}

But I'd like to have

{
  msg: "total 27",
  total: "27"
}

And not need to do

zq -Z 'total:=regexp(/total (\S+)/, msg)[1]' - <<EOF
{
  msg: "total 27"
}
EOF

{
  msg: "total 27",
  total: "27"
}
philrz commented 10 months ago

@zmajeed: Thanks for your interest in Zed!

By chance are you familiar with Grok? We actually have a PR set to merge soon (#4827) where we'll introduce a Grok function in Zed as an easier way for creating new fields based on regex matches. Grok is capable of some sophisticated parsing tasks since it's often used to derive structure from detailed text logs, but it can just as easily be used to do the kind of simple regex match you're doing here. For instance, using the branch from that PR, here's how I can use Grok to create your named field.

$ echo '{msg: "total 27"}' | zq -Z 'yield grok("total %{NOTSPACE:total}", msg)' -
{
    total: "27"
}

That NOTSPACE is just a reference to one of the built-in regex patterns that can be used to construct other patterns:

https://github.com/brimdata/zed/blob/9bb3b2f7a54d7c1a9b89f02a7e17bf0ef4f5cd6c/pkg/grok/grok-patterns#L13

If I want to combine that with the original input record, I can use the spread operator.

$ echo '{msg: "total 27"}' | zq -Z 'yield {...grok("total %{NOTSPACE:total}", msg),...this}' -
{
    total: "27",
    msg: "total 27"
}

Grok can be kind of a topic all its own so once the feature merges I expect to add more docs and a blog post and we'll be happy to help folks with creating their parsers if they need assistance getting comfortable with the syntax. Let me know what you think.

zmajeed commented 10 months ago

Let me begin with saying Zed is the best open-source language I've found for log querying so far and thanks for all your hard work - because you may not like what follows!

Sorry I've never seen Grok before but it seems overly complicated and opaque for what I described - I'm sure it's very powerful but I'd prefer not having to learn any more sigils and syntax if possible

One reason I like Zed so far is its simplicity and minimally decorative approach to querying and transforming data - I really cannot think of anything simpler than using named captures to create new fields by pattern matching - and Zed already supports the syntax just not the semantics

philrz commented 10 months ago

@zmajeed: Yes, I understand Grok can seem kind off-putting, particularly at first. Since you mentioned that log querying is a primary use case for you, at some point you may find it worthwhile to climb the learning curve. While it does involve applying a couple additional concepts beyond just regular expressions, it can ultimately make complex log parsing logic a little easier to read/maintain. But I also respect that for a one-shot like what you showed how it would feel like overkill.

Indeed, between what's there currently with regexp() and what we have planned with the Grok function, I can see how what you've proposed with creating new fields directly from named capture groups feels like a desirable "middle path". However, also due to the fact that we'll soon have the two approaches to choose from, the core Dev team that works on Zed is unlikely to add a third approach in the near future. We'll certainly hold the issue open as a reminder for when we have time and to collect interest if other users find themselves wanting the same.

Looking forward, I also want to solicit your feedback on a likely implementation. As you may have gleaned from the search expression docs, the reason why your opening example returned the input value is because the regexp match resolved to true, and the behavior of search is to act as a filter that drops each input value for which the expression evaluates to false or to an error (and hence output values that resolved to true). So while I understand the rationale for your "But I'd like to have...", to have a search expression switch to returning a newly-constructed record value when capture groups are present in a regular expression wouldn't graft well onto the current language design. Therefore whatever we implement is likely to take the form of a new function (or an enhancement to the existing regexp() one), returning a record containing the named fields and matches rather than the array of matches regexp() returns today. Could you see yourself being comfortable using that, or would you find it as off-putting as you're finding the grok() function?

zmajeed commented 10 months ago

Thanks for providing color - creating new fields is one of the most common transformations I've needed in log queries - yes - an enhanced regexp() could be like Javascript Regexp.exec() and return a record with fields for named captures plus an array of indexed matches - but I'd quickly transfer its fields to the input record - it's nice to work with fields instead of indexes but even nicer if Zed could elide this artifact in the first place

Also would be nice if nested fields could be created from named captures - so I could have

1234:75 received getMembers request from host55 pid 962

and

`/^(?<request.server.pid>\d+):(?<request.server.tid>\d+) received (?<request.name>\S+) request from (?<request.client.host>\S+) pid (?<request.client.pid>\d+)/

give me

{
  request: {
    name: "getMembers",
    server: {
      pid: 1234,
      tid: 75
    },
    client: {
      host: "host55",
      pid: 962
    }
  }
}
philrz commented 10 months ago

@zmajeed: Thanks for the pointer to JavaScript's Regexp.exec(). Indeed, we may use that for inspiration.

Regarding the nested fields, in case you bump into this elsewhere, note that there's a nest_dotted function that can help here. Example usage:

zq -Z 'nest_dotted()' - <<EOF
{
  "request.server.pid": 1234,
  "request.server.tid": 75,
  "request.name": "getMembers",
  "request.client.host": "host55",
  "request.client.pid": 962
}
EOF

Output is:

{
    request: {
        server: {
            pid: 1234,
            tid: 75
        },
        name: "getMembers",
        client: {
            host: "host55",
            pid: 962
        }
    }
}

I point this out because we have the same thing to consider in the Grok implementation. Some Grok implementations reject field names containing dots altogether, and we don't want to have that limitation. We could assume dots always imply nesting, but since field names containing dots are legal in JSON, we're currently leaning toward letting them pass through with dots intact, since applying the function can nest them if that's the desired result. So we may end up doing the same here as well.

zmajeed commented 9 months ago

Tried nest_dotted() on an example that assumes new captured fields are in groups

zq -Z 'nest_dotted(groups)' - <<EOF
{
  "groups": {
    "request.server.pid": 1234,
    "request.server.tid": 75,
    "request.name": "getMembers",
    "request.client.host": "host55",
    "request.client.pid": 962
  }
}
EOF
{
  groups: {
    "request.server.pid": 1234,
    "request.server.tid": 75,
    "request.name": "getMembers",
    "request.client.host": "host55",
    "request.client.pid": 962
  }
}

Could nest_dotted() be avoided if a parameter were set to nest the returned captured fields - like regex(/(?P<request.server.pid>\d+)/, nest_captures)

{
  request: {
    server: {
      pid: 1234
    }
  }
}
philrz commented 9 months ago

@zmajeed: Thanks for reporting your nest_dotted(groups) not seeming to have any effect. It turns out that's a new bug, so a fix for that is being tracked via #4914. Someone's already working on a fix.

Thinking ahead to when that's fixed, when a developer implements the original feature captured in this issue, I'm not sure if they'll make the design decision let the dotted field names exist with the expectation nest_dotted() could be applied or offer some kind of flag to immediately create them in nested form like you proposed. It's effectively a language design question. I previously predicted the approach with relying on downstream nest_dotted() because it's where I've seen the language design headed over time, but that could change.

philrz commented 9 months ago

Update: The initial grok() support has been merged (#4827) but it doesn't yet support their own named capture support (such as is shown in the first syntax for custom patterns the Logstash Grok filter doc.) If/when we get to adding support in Zed for the general non-Grok regexp use case it might be good to see if we could add support in Zed's grok() at the same time.