European-XFEL / h5glance

Explore HDF5 files in terminal & HTML views
BSD 3-Clause "New" or "Revised" License
68 stars 8 forks source link

Add option to fold groups #33

Open tdegeus opened 1 year ago

tdegeus commented 1 year ago

I have a file

/data/u/1
/data/u/2
/data/u/3
/data/u/...
/a
/b

The output of h5glance is therefore rather long and difficult to read. It would be good to have an option --fold, such that :

$ h5glance --fold "/data/u" foo.h5
foo.h5
├a  [uint64: 238] (1 attributes)
├b [enum (FALSE, TRUE): 238] (1 attributes)
├data
│ ├u {folded: 12345 datasets}
takluyver commented 1 year ago

We've got a --depth option - obviously that's not quite as flexible as what you're suggesting, but it would work for the example you've described (which may obviously be a bit different from your real files).

$ h5glance example.h5 --depth 2
example.h5
└group1
  ├subgroup1    (2 children)
  └subgroup2    (1 children)
tdegeus commented 1 year ago

Thanks. I noticed the depth. Indeed the proposed --fold is only partly overlapping. Due to the way I designed my data I do use --fold a lot in my own tool. Would you be open to supporting it?

takluyver commented 1 year ago

I'm open to doing something for that scenario, but I'd like to think about what's the best approach, and especially if there's a way it can be automatically useful without needing an option. Are you able to share one of the files you're talking about, or just the real structure of it (I assume your datasets aren't really called a and b :slightly_smiling_face: )?

One idea would be to elide the list, especially if all the datasets are similar in terms of dtype and shape:

└data
  └u
    ├1 [int64; 200 x 200]
    ├2 [int64; 200 x 200]
    ├3 [int64; 200 x 200]
    └and 4567 similar datasets

We could maybe even recognise sequences of consecutive numbers and treat them specially:

└data
  └u
    ├1 [int64; 200 x 200]
    ├...
    └4570 [int64; 200 x 200]

Although that raises further questions, e.g. do you also recognise A-1, A-2 ...? What about sequences that are regular but not consecutive, e.g. 10, 20, 30...?

Yet another option is to say that with the automatic paging, long output doesn't matter if the bits you're interested in are all towards the top, so maybe sorting entries with fewer children first helps. I'm not so sold on this one, just throwing it in to see if it prompts more ideas.

tdegeus commented 1 year ago

I find it a very good idea to do this automatically. One could use

  1. Is it a list of numbers? Yes -> 2. No -> stop.
  2. Is it a sequential list?

    • Yes:
      ├1 [int64; 200 x 200]
      ├...
      └4570 [int64; 200 x 200]
    • No:
      ├1 [int64; 200 x 200]
      ├2 [int64; 200 x 200]
      ├3 [int64; 200 x 200]
      └and 4567 similar datasets

Then indeed one could discuss on how to detect a list of numbers. Zeropadding should probably always be stripped, and one can argue about common prefix. (I use neither, so not strongly opinionated).


Here is one of my files (truncated, but clearly showing why I would like this option):

QuasiStatic/id=0000.h5
├QuasiStatic
│ ├inc  [uint64: 174] (1 attributes)
│ ├kick [enum (FALSE, TRUE): 174] (1 attributes)
│ ├u
│ │ ├0  [float64: 8192]
│ │ ├1  [float64: 8192]
│ │ ├10 [float64: 8192]
│ │ ├100    [float64: 8192]
│ │ ├101    [float64: 8192]
│ │ ├102    [float64: 8192]
│ │ ├103    [float64: 8192]
│ │ ├104    [float64: 8192]
│ │ ├105    [float64: 8192]
│ │ ├106    [float64: 8192]
│ │ ├107    [float64: 8192]
│ │ ├108    [float64: 8192]
│ │ ├109    [float64: 8192]
│ │ ├11 [float64: 8192]
│ │ ├110    [float64: 8192]
│ │ ├111    [float64: 8192]
│ │ ├112    [float64: 8192]
│ │ ├113    [float64: 8192]
│ │ ├114    [float64: 8192]
│ │ ├115    [float64: 8192]
│ │ ├116    [float64: 8192]
│ │ ├117    [float64: 8192]
│ │ ├118    [float64: 8192]
│ │ ├119    [float64: 8192]
│ │ ├12 [float64: 8192]
│ │ ├120    [float64: 8192]
│ │ ├121    [float64: 8192]
│ │ ├122    [float64: 8192]
│ │ ├123    [float64: 8192]
│ │ ├124    [float64: 8192]
│ │ ├125    [float64: 8192]
│ │ ├126    [float64: 8192]
│ │ ├127    [float64: 8192]
│ │ ├128    [float64: 8192]
│ │ ├129    [float64: 8192]
│ │ ├13 [float64: 8192]
│ │ ├130    [float64: 8192]
│ │ ├131    [float64: 8192]
│ │ ├132    [float64: 8192]
│ │ ├133    [float64: 8192]
│ │ ├134    [float64: 8192]
│ │ ├135    [float64: 8192]
│ │ ├136    [float64: 8192]
│ │ ├137    [float64: 8192]
│ │ ├138    [float64: 8192]
│ │ ├139    [float64: 8192]
│ │ ├14 [float64: 8192]
│ │ ├140    [float64: 8192]
│ │ ├141    [float64: 8192]
│ │ ├142    [float64: 8192]
│ │ ├143    [float64: 8192]
│ │ ├144    [float64: 8192]
│ │ ├145    [float64: 8192]
│ │ ├146    [float64: 8192]
│ │ ├147    [float64: 8192]
│ │ ├148    [float64: 8192]
│ │ ├149    [float64: 8192]
│ │ ├15 [float64: 8192]
│ │ ├150    [float64: 8192]
│ │ ├151    [float64: 8192]
│ │ ├152    [float64: 8192]
│ │ ├153    [float64: 8192]
│ │ ├154    [float64: 8192]
│ │ ├155    [float64: 8192]
│ │ ├156    [float64: 8192]
│ │ ├157    [float64: 8192]
│ │ ├158    [float64: 8192]
│ │ ├159    [float64: 8192]
│ │ ├16 [float64: 8192]
│ │ ├160    [float64: 8192]
│ │ ├161    [float64: 8192]
│ │ ├162    [float64: 8192]
│ │ ├163    [float64: 8192]
│ │ ├164    [float64: 8192]
│ │ ├165    [float64: 8192]
│ │ ├166    [float64: 8192]
│ │ ├167    [float64: 8192]
│ │ ├168    [float64: 8192]
│ │ ├169    [float64: 8192]
│ │ ├17 [float64: 8192]
│ │ ├170    [float64: 8192]
│ │ ├171    [float64: 8192]
│ │ ├172    [float64: 8192]
│ │ ├173    [float64: 8192]
│ │ ├18 [float64: 8192]
│ │ ├19 [float64: 8192]
│ │ ├2  [float64: 8192]
│ │ ├20 [float64: 8192]
│ │ ├21 [float64: 8192]
│ │ ├22 [float64: 8192]
│ │ ├23 [float64: 8192]
│ │ ├24 [float64: 8192]
│ │ ├25 [float64: 8192]
│ │ ├26 [float64: 8192]
│ │ ├27 [float64: 8192]
│ │ ├28 [float64: 8192]
│ │ ├29 [float64: 8192]
│ │ ├3  [float64: 8192]
│ │ ├30 [float64: 8192]
│ │ ├31 [float64: 8192]
│ │ ├32 [float64: 8192]
│ │ ├33 [float64: 8192]
│ │ ├34 [float64: 8192]
│ │ ├35 [float64: 8192]
│ │ ├36 [float64: 8192]
│ │ ├37 [float64: 8192]
│ │ ├38 [float64: 8192]
│ │ ├39 [float64: 8192]
│ │ ├4  [float64: 8192]
│ │ ├40 [float64: 8192]
│ │ ├41 [float64: 8192]
│ │ ├42 [float64: 8192]
│ │ ├43 [float64: 8192]
│ │ ├44 [float64: 8192]
│ │ ├45 [float64: 8192]
│ │ ├46 [float64: 8192]
│ │ ├47 [float64: 8192]
│ │ ├48 [float64: 8192]
│ │ ├49 [float64: 8192]
│ │ ├5  [float64: 8192]
│ │ ├50 [float64: 8192]
│ │ ├51 [float64: 8192]
│ │ ├52 [float64: 8192]
│ │ ├53 [float64: 8192]
│ │ ├54 [float64: 8192]
│ │ ├55 [float64: 8192]
│ │ ├56 [float64: 8192]
│ │ ├57 [float64: 8192]
│ │ ├58 [float64: 8192]
│ │ ├59 [float64: 8192]
│ │ ├6  [float64: 8192]
│ │ ├60 [float64: 8192]
│ │ ├61 [float64: 8192]
│ │ ├62 [float64: 8192]
│ │ ├63 [float64: 8192]
│ │ ├64 [float64: 8192]
│ │ ├65 [float64: 8192]
│ │ ├66 [float64: 8192]
│ │ ├67 [float64: 8192]
│ │ ├68 [float64: 8192]
│ │ ├69 [float64: 8192]
│ │ ├7  [float64: 8192]
│ │ ├70 [float64: 8192]
│ │ ├71 [float64: 8192]
│ │ ├72 [float64: 8192]
│ │ ├73 [float64: 8192]
│ │ ├74 [float64: 8192]
│ │ ├75 [float64: 8192]
│ │ ├76 [float64: 8192]
│ │ ├77 [float64: 8192]
│ │ ├78 [float64: 8192]
│ │ ├79 [float64: 8192]
│ │ ├8  [float64: 8192]
│ │ ├80 [float64: 8192]
│ │ ├81 [float64: 8192]
│ │ ├82 [float64: 8192]
│ │ ├83 [float64: 8192]
│ │ ├84 [float64: 8192]
│ │ ├85 [float64: 8192]
│ │ ├86 [float64: 8192]
│ │ ├87 [float64: 8192]
│ │ ├88 [float64: 8192]
│ │ ├89 [float64: 8192]
│ │ ├9  [float64: 8192]
│ │ ├90 [float64: 8192]
│ │ ├91 [float64: 8192]
│ │ ├92 [float64: 8192]
│ │ ├93 [float64: 8192]
│ │ ├94 [float64: 8192]
│ │ ├95 [float64: 8192]
│ │ ├96 [float64: 8192]
│ │ ├97 [float64: 8192]
│ │ ├98 [float64: 8192]
│ │ └99 [float64: 8192]
│ └u_frame  [float64: 174] (1 attributes)
├meta
│ └QuasiStatic_Run (6 attributes)
├param
│ ├data_version [UTF-8 string: scalar]
│ ├dt   [float64: scalar]
│ ├eta  [float64: scalar]
│ ├interactions
│ │ ├k2 [float64: scalar]
│ │ ├k4 [float64: scalar]
│ │ └type   [UTF-8 string: scalar]
│ ├k_frame  [float64: scalar]
│ ├m    [float64: scalar]
│ ├mu   [float64: scalar]
│ ├normalisation
│ │ └u  [int64: scalar]
│ ├potentials
│ │ ├du [float64: scalar]
│ │ ├type   [UTF-8 string: scalar]
│ │ ├weibull
│ │ │ ├k    [int64: scalar]
│ │ │ ├mean [int64: scalar]
│ │ │ └offset   [float64: scalar]
│ │ └xoffset    [int64: scalar]
│ └shape    [int64: 1]
└realisation
  └seed [int64: scalar]
tdegeus commented 1 year ago

Another point is that alignment between the types of datasets of the same level could be improved. I guess that my 'complaint' in https://github.com/European-XFEL/h5glance/issues/35 would partly disappear if the alignment of the types is improved

takluyver commented 1 year ago

Thanks - that example makes it clear that another part of the puzzle could be 'natural' sorting, where you detect numbers and sort them as numbers rather than text. The natsort package does this, although if we want it I'd be tempted to implement what we need in h5glance rather than adding a dependency (assuming a simple version is enough).

That would also help with alignment (11 wouldn't be between 109 and 110, so the position would shift less). The implementation is really simple at present - it just puts a tab character before the [, which is nice because we can format one line at a time without having to figure out how later ones will look.

The obvious smarter thing to do - finding the maximum width in a group and aligning the details based on that - is also not great if e.g. if most names are 4 characters but a handful are 60 characters, because it's hard to line up names on the left with details on the right. Of course, you can get smarter about trying to balance that, but by that point the \t kludge is starting to sound pretty good.

tdegeus commented 1 year ago

Thanks - that example makes it clear that another part of the puzzle could be 'natural' sorting, where you detect numbers and sort them as numbers rather than text. The natsort package does this, although if we want it I'd be tempted to implement what we need in h5glance rather than adding a dependency (assuming a simple version is enough).

Agreed, presenting numbered datasets sorted in their number order would make a lot of sense!!

That would also help with alignment (11 wouldn't be between 109 and 110, so the position would shift less). The implementation is really simple at present - it just puts a tab character before the [, which is nice because we can format one line at a time without having to figure out how later ones will look.

Indeed.

The obvious smarter thing to do - finding the maximum width in a group and aligning the details based on that - is also not great if e.g. if most names are 4 characters but a handful are 60 characters, because it's hard to line up names on the left with details on the right. Of course, you can get smarter about trying to balance that, but by that point the \t kludge is starting to sound pretty good.

Well yeah, this is a subtle point requiring a fiddle parameter. Yet, I do think it would be easy to decide on: alignment should only be done when having to add order 1-3 'tabs'. This will not be satisfactory always of course, but already having 1 will pick the very low-hanging fruit.

A note: Already in the above file you do see some examples of where alignment would help to make the overall structure more readable. E.g.

│ │ ├du [float64: scalar]
│ │ ├type   [UTF-8 string: scalar]

which just clutters a bit. Or

│ ├interactions
│ │ ├k2 [float64: scalar]
│ │ ├k4 [float64: scalar]
│ │ └type   [UTF-8 string: scalar]
│ ├k_frame  [float64: scalar]

which at first glance could suggest a grouping a type with k_frame -- which will be less likely with some alignment.

tdegeus commented 1 year ago

I started looking at the code. Folding is relative simple here

https://github.com/European-XFEL/h5glance/blob/7d183f056071fd8ecf63006fa92391ced78feaaf/h5glance/terminal.py#L161-L162

What is holding me back a bit is that it could be somewhat clumsy (and overly costly) to have to re-interpret the string again. At the same time, it could also also for some rudimentary alignment.

But maybe you had something else in mind @takluyver ?

tdegeus commented 1 year ago

@takluyver Good you give me some pointers?