brimdata / super

A novel data lake based on super-structured data
https://zed.brimdata.io/
BSD 3-Clause "New" or "Revised" License
1.39k stars 64 forks source link

Converting multiple pools to one pool (virtual) #4582

Open AFgh24 opened 1 year ago

AFgh24 commented 1 year ago

Converting multiple pools to one pool (virtual)

Hello

I have a lot of pools though it is possible to search them all. But I have a small problem I have a problem to search only in ip columns, because in all pool the name of the desired column is different and not just ip

Is it possible to define a virtual pool? And it is possible to change the names of the columns and add the list of columns in a virtual pool

For example, my pools that I want to search Columns with different names

pool1=iplogin pool2=Fristip ====> PoolALL=Allip pool3=lostip pool4=ipcountry

In this case, the search will be easy and I will only search the desired column and not all the data pool is a virtual interface and does not store any data in it

To search among all defined columns, I only search allip

philrz commented 1 year ago

@AFgh24: Indeed there may not be a straightforward way to achieve precisely what you're describing. However some of what you've shown points to ideas we've had in the past that we've not yet implemented, so perhaps you can help us better understand your goals and based on that determine which of the planned designs would suit your needs.

(Also, I've moved this issue into the Zed repository. Since you opened it in the Zui repository I assume you're working in the Zui app. But any features here would need to exist at the Zed layer before they're available in the app. In the interest of simplicity I'm going to start with examples using zq at the command line, so hopefully you're familiar with that.)

First I'd like to focus on just this statement:

I have a problem to search only in ip columns, because in all pool the name of the desired column is different and not just ip

When you speak of "ip columns", are you referring specifically to columns that hold values of Zed's ip type as described in the table at https://zed.brimdata.io/docs/formats/zed#1-primitive-types? If so, this concept of "type-specific searches" is something that's been on our to-do list for a while. For instance, imagine this input data:

$ cat testdata.zson 
{ActualIP: 192.168.0.1}
{LogMessage: "The request came from IP address 192.168.0.1"}

As described here, currently a non-string search for an IP address will find both ip-type values that match exactly as well as where it appears as a string inside of string values. Therefore this search returns both:

$ zq '192.168.0.1' testdata.zson 
{ActualIP:192.168.0.1}
{LogMessage:"The request came from IP address 192.168.0.1"}

Type-specific searches are possible if the field name is known.

$ zq 'ActualIP==192.168.0.1' testdata.zson 
{ActualIP:192.168.0.1}

However, this returns nothing because "192.168.0.1" is a string value and therefore is not a match against the ip-type value in the field called ActualIP.

$ zq 'ActualIP=="192.168.0.1"' testdata.zson
[no output]

As mentioned above, we've known for some time that we're missing this concept of "type-specific searches" that would work across fields of any name. #1428 is the open issue, though it's fairly old so some of the examples in it may seem confusing. But when we implement that, it should ultimately allow you to do something like from * | 192.168.0.1 and return everything that contains the ip-type value 192.168.0.1 in it. Let me know if this is maybe what you had in mind.

You then went on to suggest a specific enhancement:

Is it possible to define a virtual pool? And it is possible to change the names of the columns and add the list of columns in a virtual pool

When you described this, what came to mind is the concept of "views" in SQL such as what are described here. Is that an idea you're familiar with?

When you said you want to "search the desired column and not all the data", do you mean that you want the search to return only the matching values from the named columns? If so, here's an example of something you could do today. First I'll prep some test data with the field names you described and load them into pools.

$ cat pool1.zson 
{iplogin: 192.168.0.1}
{nonipfield1: "hello1"}

$ cat pool2.zson 
{Fristip: 192.168.0.2}
{nonipfield2: "hello2"}

$ cat pool3.zson 
{lostip: 192.168.0.3}
{nonipfield3: "hello3"}

$ cat pool4.zson 
{ipcountry: 192.168.0.4}
{nonipfield4: "hello4"}
{ipfield5: 192.168.0.5}

$ cat load.sh 
#!/bin/sh
zed create -use pool1
zed load pool1.zson
zed create -use pool2
zed load pool2.zson
zed create -use pool3
zed load pool3.zson
zed create -use pool4
zed load pool4.zson

$ sh load.sh 
pool created: pool1 2PZpSNDMEeRAGMsGmouAoH2C8Uu
(11/1) 47B/47B 47B/s 100.00%
2PZpSPfvkLaNKw1GJSkM9tArDUZ committed
pool created: pool2 2PZpSJafavCG1B4OEoTmXRbGjR6
(11/1) 47B/47B 47B/s 100.00%
2PZpSJANMZknsccXt3WxOHjy4an committed
pool created: pool3 2PZpSNZLhfZQAMi9MNmtgodaBoA
(11/1) 46B/46B 46B/s 100.00%
2PZpSIdQvW6AR5ZE77lBF8YVVDZ committed
pool created: pool4 2PZpSOyL9Ye36m8cSU3P2g669YP
(12/1) 73B/73B 73B/s 100.00%
2PZpSPD98SWKvbHs9OIRH8IENqs committed

Now here's some Zed that would isolate the named fields in each respective pool and merge them all into a single output.

$ cat PoolAll.zed 
from (
  pool pool1 => has(iplogin) | cut iplogin
  pool pool2 => has(Fristip) | cut Fristip
  pool pool3 => has(lostip) | cut lostip
  pool pool4 => fork (
    => has(ipcountry) | cut ipcountry
    => has(ipfield5) | cut ipfield5
  )
)

$ zed query -I PoolAll.zed
{ipfield5:192.168.0.5}
{ipcountry:192.168.0.4}
{iplogin:192.168.0.1}
{lostip:192.168.0.3}
{Fristip:192.168.0.2}

So now to search for a given IP, you could extend the pipeline with a search term:

$ zed query -I PoolAll.zed '| 192.168.0.3'
{lostip:192.168.0.3}

Note that I added that additional filed ipfield5 to the last pool to show how to use fork when you've got multiple named fields you want to isolate.

Like before, these are not type-specific searches, but since it sounds like maybe you want to isolate based on the names of fields that you believe contain IP addresses you want to search against, I assume this is ok.

Since I know you're using Zui, I've attached a video that shows how you could use the same Zed inside the app. By making it into a saved query, it becomes easier to call it up when you need it and also make changes as you think of more fields.

https://github.com/brimdata/zed/assets/5934157/8423d3d3-2127-443a-a9e5-31326fcd181f

In conclusion, this doesn't yet give you a handle like allip you could reference with from like a virtual pool, but it does provide a building block upon which you can perform a single query and have it target all those named fields. And referencing such building blocks should become easier once we finish another enhancement #4152 that's in progress.

Could you let me know if any of the examples I've shown above cover what you're trying to achieve? If not, could you let me know where they differ from your goals?

philrz commented 11 months ago

A community user @SoftTools59654 recently asked about similar functionality in a new issue https://github.com/brimdata/zui/issues/2889 opened in the Zui repo, so I'm copying their original inquiry below to consolidate and will encourage them to read some of what's above to see how close it gets them to covering their immediate needs and incorporate their feedback into when we might implement something more sophisticated tailored to this specific use case.


Creating a virtual pool image I have a problem to check the log of different software

With the increase in the number of logs due to the different title of the logs as shown in the image below

I encounter problems in some searches. I forget that the titles of some fields are different. Because the number of pools also increases

Is it possible to create a main virtual pool and then define some main field titles and during data import the user has the possibility to choose the connection between this imported field and the main virtual pool to apply the search only to the main pool rather than using multiple field titles for multiple pools

Connecting different pools to a specific pool for easy searching

SoftTools59654 commented 11 months ago

Thanks for the complete explanation

It's really great that you put the explanations with the video (I saw in other explanations that you put the problems or possibilities with the video, it's really great)

I needed such a feature. The only difference that I wanted was a graphical interface, which makes the work easier, but the output result is the same as what I wanted.

This is also a good option. But the graphical interface reduces the possibility of mistakes and is faster

I think searching in this style. be more efficient and spend less time searching

Thank you for your hard work on this tool

philrz commented 11 months ago

@SoftTools59654: Thanks for the feedback and confirming that the Zed example shown in the video could act as a functional stopgap measure for you. Indeed, I agree that a graphic interface could be useful for this use case, both in terms of configuring the field relationships and also presenting such a virtual pool as something to be queried directly. Based on your comment I can see you're sympathetic to the fact that the core Dev team currently has other priorities so building such a GUI is unlikely to happen in the near future, but I'll hold this issue open to continue to accumulate interest from the community and as a reminder for us to take it up when we can. In the meantime please do speak up if you need help improving the stopgap Zed approach.