amazon-ion / ion-cli

Apache License 2.0
31 stars 15 forks source link

Add subcommands to support data stream analysis. #104

Closed linlin-s closed 1 month ago

linlin-s commented 4 months ago

Issue #, if available:

https://github.com/amazon-ion/ion-cli/issues/42

Description of changes:

This PR adds options to analyze the input binary Ion data stream, allowing users to:

Additionally, the existing count option has been reorganized under the analyze command. The usage pattern has been updated from: ion beta count [File1, File2 ...] to ion beta analyze count [File1, File2 ...].

Here is the overview of the command structure before and after the change. Before

ion
└─ beta
    └─ count (The number of total top-level values)

After

ion
└─ beta
    └─ analyze
          └─ count (The number of total top-level values)
          └─ size (Return min/max/mean size and the size distribution of the top-level values)
          └─ depth (The maximum depth of input ion stream)

For symbol table analysis, we added two options, here is the command structure before and after this change. Before:

ion
└─ beta
    └─ symtab
          └─ filter (Filters user data out of a binary Ion stream, leaving only the symbol table(s) behind.)

After:

ion
└─ beta
    └─ symtab
          └─ filter (Filters user data out of a binary Ion stream, leaving only the symbol table(s) behind.)
          └─ count (The number of local symbol tables of the input ion stream)
          └─ symbol_count (The number of symbols of the input ion stream)

Testing:

Examples: Request the maximum depth of input ion stream > ion beta analyze depth test.10n Output: > The maximum depth is 4

Request the size information of the top level values > ion beta analyze size test.10n Output:

image

Request the number of local symbol table of input ion steam > ion beta symtab count test.10n Output: > The number of symbol tables is 387

Request the number of symbols of the input ion stream > ion beta symtab symbol_count test.10n Output: > The number of symbols is 67808444

These examples above show what you should expect when you execute these new options. All the unit tests for these functionalities are added into the tests/cli.rs file.

Note:


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

linlin-s commented 4 months ago

I have a question about API design. Would someone ever want to get depth, count, etc. values at the same time? Do you think there's any use case where you absolutely must have only one of these values?

My gut reaction is that it would make sense to have a single ion beta analyze command that combines analyze count, analyze size, analyze depth, symtab count and symtab symbol_count into one command.

Thanks for the suggestion. I agree with the idea of putting all these analytical options in one subcommand. The count information is already included in the returned value of the size subcommand, so there's no need to keep the count. I'm wondering if keeping all the symbol table related subcommands under symtab could be an option as well? e.g., ion beta symtab analyze will return the number of local symbol tables and the number of symbols.

linlin-s commented 4 months ago

These new features look really useful!

I agree with Matt's comment that it would be nice if we had the option to get all the data at once (maybe something like ion beta analyze all). Additionally, having an option to receive the output in a structured format (Ion!) may be useful, e.g. {value_count: 59155, symbol_table_count: 386, ...}.

Other stats that may be useful:

* How many of the symbol tables in the stream are "brand new" vs "append"?

* Stats / histogram (like you have for value sizes) for number of symbols introduced by the symbol tables. E.g. what is the average number of new symbols introduced, and how does that change throughout the stream?

* Ability to show (i.e., dump) the largest, smallest, or average-size value in the stream.

I mention these not because I expect you to implement them now, but just in case thinking of expansion possibilities influences how you choose to structure the commands.

Thanks for providing these new ideas. This would be a great reference for the further expansion.

jobarr-amzn commented 4 months ago

For what it's worth I would prefer stat or stats as the command here, it matches another CLI which people are likely familiar with (git).

I think we should have the same "porcelain" vs. "plumbing" distinction in ion commands (see Git Internals Plumbing and Porcelain), rather than a beta subtree where we push everything because we regard that which is not beta as "plumbing".

linlin-s commented 3 months ago

For the API design, I agree with combining analyze count, analyze size, analyze depth, symtab count, and symtab symbol_count into one command. All the information about top-level data size and the symbol table can be retrieved from SystemReader. However, to get the maximum depth, we need to explore elements from the user reader recursively(current implementation). We can also explore readers on different levels, but it will involve a lot of step_in and step_out, which is not efficient for getting the depth. In this case, combining depth with the rest of the subcommands would require using two different readers to retrieve all information at once, which could be expensive when the data size is significantly large.

LazySystemReader could be an option to achieve this, and we can explore LazyValue to avoid step_in/step_out operations when getting the maximum depth of the nested input data stream. However, there are no public methods in lazy_system_reader to get information about value length under the current version of ion-rust. This feature might be supported in the future, but for now, here is the proposed solution:

Combine all of the subcommands into one except depth.

> ion beta stats 

Samples = 59155; Min = 21; Max = 1519
Average = 374; Variance = 195565.635; STD = 442.228
Each ∎ represents a count of 426

[  21 ..  396] [38823] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
[ 396 ..  770] [ 1137] ∎∎
[ 770 .. 1144] [17845] ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
[1144 .. 1519] [ 1350] ∎∎∎
The number of symbols is 67808444 
The number of local symbol tables is 387 
> ion beta depth 
The maximum depth is 4

Merge ion beta depth to ion beta stats when LazyValue supports accessing the underlying stream data.

popematt commented 3 months ago

We can also explore readers on different levels, but it will involve a lot of step_in and step_out, which is not efficient for getting the depth.

Using Element also has the same number of step in/out—it's just a higher level of abstraction that hides the reader from you.

combining depth with the rest of the subcommands would require using two different readers to retrieve all information at once, which could be expensive when the data size is significantly large.

How expensive? If its already 5 seconds for a 20 MB file, and combining the two of them makes it take a few more seconds, I think that's fine. We can always improve the efficiency of it later, and this is something that (a) should not be in a performance critical path, and (b) is still in the beta namespace.

Samples = 59155; Min = 21; Max = 1519 Average = 374; Variance = 195565.635; STD = 442.228 Each ∎ represents a count of 426

Can we indicate the units for some of these?

linlin-s commented 3 months ago

We can also explore readers on different levels, but it will involve a lot of step_in and step_out, which is not efficient for getting the depth.

Using Element also has the same number of step in/out—it's just a higher level of abstraction that hides the reader from you.

combining depth with the rest of the subcommands would require using two different readers to retrieve all information at once, which could be expensive when the data size is significantly large.

How expensive? If its already 5 seconds for a 20 MB file, and combining the two of them makes it take a few more seconds, I think that's fine. We can always improve the efficiency of it later, and this is something that (a) should not be in a performance critical path, and (b) is still in the beta namespace.

Samples = 59155; Min = 21; Max = 1519 Average = 374; Variance = 195565.635; STD = 442.228 Each ∎ represents a count of 426

Can we indicate the units for some of these?

If performance is not the concern here I can make them all in one command. We can always optimize it when the LazySystemReader supports accessing underlying stream data. I will add comment on the final report to indicates the units. plot::Histogramitself doesn't support adding units for the data.

linlin-s commented 1 month ago

This PR has been updated in #134