brimdata / zed

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.37k stars 67 forks source link

search_replace function #5186

Open philrz opened 1 month ago

philrz commented 1 month ago

tl;dr

A community zync user had created a switch-based Zed program for a large search-and-replace task in their data transformation pipeline. They asked for some assistance in simplifying it for maintainability. We came up with some improvements using existing Zed building blocks, but in the end @mccanne had the following thought for something new we could write to nail this directly:

What if we used capture groups in Go's regexp library to create a search_replace function? You could give it the map and we would translate the patterns to capture groups then use the capture group index to select the replacement string.

Details

At the time this issue is being filed, Zed it at commit 9766d17.

The original question from the community zync user was posed as:

do you know if there is way to declare a const with a regular expression ? ex: const myregexp=/myr.+n/1

Indeed, this is not currently possible in Zed. There's other examples of places where a user might expect to be able to use regexps but can't (#4917).

To illustrate the use case, the user shared their program, but it's confidential and can't be pasted here. However, here's a simplified program switch-grep.zed that uses their approach:

switch (
  case (grep(/foo|bar/, desc)) => category := "The Foobar category"
  case (grep(/a.*\-z.*/, desc)) => category := "The A-to-Z category"
  default => category := "The default category"
)

With input data descriptions.zson:

{"desc": "It's foo time"}
{"desc": "a is the first letter-z is last"}
{"desc": "Something else"}

Running it:

$ zq -version
Version: v1.17.0-1-g9766d17d

$ zq -I switch-grep.zed descriptions.zson 
{desc:"It's foo time",category:"The Foobar category"}
{desc:"Something else",category:"The default category"}
{desc:"a is the first letter-z is last",category:"The A-to-Z category"}

The user's program actually had about 70 case statements.

In addition to its sheer size, a couple other challenges with maintainability are evident here:

  1. There's a lot of code repeated in support of the actual strings that form the search/replace pairs (i.e., case (grep(..., desc)) => category := ...
  2. Per the user's point, if the regexps could be defined as const, they could be more easily re-used in other contexts (e.g., define them in a separate file that's included with zq -I ... to be invoked in many programs)

After a little hacking, I found this could be improved to this switch-regexp.zed:

const changes = |{
  "foo|bar": "The Foobar category",
  "a.*\\-z.*": "The A-to-Z category"
}|

category := coalesce((over changes with desc | switch (case regexp(key, desc) != null => yield value | head 1)), "The default category")

Running it gives the same output we saw before.

$ zq -I switch-regexp.zed descriptions.zson 
{desc:"It's foo time",category:"The Foobar category"}
{desc:"a is the first letter-z is last",category:"The A-to-Z category"}
{desc:"Something else",category:"The default category"}

Things to highlight:

  1. By using over we're able to avoid the repeat of all the case clauses
  2. By using the regexp function instead of grep we're able to leverage the former's unique ability to take a regular expression that's defined as a string, and strings can be defined via const. However, this does come with one caveat: Some escape sequences that aren't valid for a string but are for regex (such as the \- in the second pattern) now need a double backslash.

The user was satisfied with this improvement. However, having watched this unfold, @mccanne had the idea quoted above for a purpose-built search_replace function that would allow the user to avoid needing to know or look up the coalesce/over/switch/regexp combination shown here. If regexps in their / / form also became first class concepts at the same time that would surely be convenient to the user as well since they'd be able to avoid the "double backslash" overhead when creating the map (and hence more easily re-use regexps from other tools without modification), but this seems orthogonal.

philrz commented 1 month ago

Building further off what was shown previously, in the absence of the proposed purpose-built search_replace function written in Go as a core part of the Zed language, it could be written as a user-defined function such as this search-replace-func.zed.

func search_replace (s, change_map, default): (
  coalesce((over change_map with s | switch (case regexp(key, s) != null => yield value | head 1)), default)
)

I've also broken out the map of pairs for search/replace in this changes.zed.

const changes = |{
  "foo|bar": "The Foobar category",
  "a.*\\-z.*": "The A-to-Z category"
}|

Putting it all together at the command line:

$ zq -I search-replace-func.zed -I changes.zed 'category := search_replace(desc, changes, "The default category")' descriptions.zson 
{desc:"It's foo time",category:"The Foobar category"}
{desc:"a is the first letter-z is last",category:"The A-to-Z category"}
{desc:"Something else",category:"The default category"}

This shows simple modularity at the CLI. We've also discussed the idea of more formally introducing "Zed modules" (#2599) and related concepts like having a standard library of functionality written as Zed to augment the rest of the core functionality implemented in Go, community users being able to share their own modules/libraries easily such as via GitHub, and so forth. This opens the door to being able to deliver enhancements in the short term as Zed that might be hacky at times but can be used like a black box. If perf problems or functional gaps in such an implementation limit usability, usage in its initial form would help validate the use cases for more ambitious core functionality later written in Go.

philrz commented 1 month ago

Also related: The original user later expressed an interest in being able to pull the search/replace pairings from a pool rather than defining them in a const. I was able to hack something together that does this using existing building blocks, but it's super ugly. A better approach would be if we had what's described in #3201.

Separately, @mattnibs has looked over everything shown here and had the impression that a "cross-product join" would be the way he'd approach this problem.