Does the data block needs to be binary?

LibrEars commented 2 years ago

Hi all,

I would like to save my astropy QTables with the asdf format to also save the meta dictionary and units. The only thing that holds me back is comparability with other users who might not want to use python to evaluate the data.

Would it be possible to store the data (relatively small) after the yaml header in a human-readable format like ecsv does? Or is there a hard reason why it needs to be a binary block?

PS: I found the ASDF-option to save inline arrays inside the yaml, but I think it is not accessible via QTable.write() and it seems not very human-readable nor easy to extract with GUI-software.. PS2: using ecsv with QTable.read() it does not import the meta dictionary as a dictionary and asdf seems more future-proof.

WilliamJamieson commented 2 years ago

@LibrEars,

Thanks for the feedback. Unfortunately, there is not an elegant way to save arrays inside yaml and I would strongly encourage you to not attempt saving "too big" as it can make reading the asdf file quite slow. However, the inline arrays have been designed so that they can be easily parsable by any yaml parser.

As for your main request could you provide me with minimal code example which produces an example of a small (just a few rows) QTable that you want to save (random data is fine)?

perrygreenfield commented 2 years ago

Could you also clarify what you mean by human readable? Just because a block can contain binary doesn't preclude the contents being simple text in principle (though the current python interface is very numpy-oriented). If your intent is that people can edit this block with a text editor, yes, there are some binary words before the actual contents that may complicate that. On the other hand, are you asking for a way to write to and read this human readable content from the binary block. Along those lines, can you show how you would like this to be used in code (writing and reading)?

LibrEars commented 2 years ago

Hi all, thank you for the quick replies :). Here is some code as an explenatin:

# Example question on human readable asdf: https://github.com/astropy/asdf-astropy/issues/118#issuecomment-1267339629

# %% Import modules
import time

import numpy as np
from astropy.table import QTable
import astropy.units as u

# %% Meta-data of the experiment
meta = {"Experimentalist":"LibrEars",
        "measurement_type": "flux_of_fluxgenerator",
        "nr":42, "pix":7, "voltage":-2,
        "time":time.asctime(time.localtime()),
        "temperature":37}

# %% Store data columns in a astropy QTable (from fluxgenerator measurements)
current = np.linspace(0,20, 20)
flux = np.ones(20)
data = QTable([current, flux], names=["Curren", "Fluxgenerated_flux"], units=[u.A, u.flx])

# attach meta-data to QTable
data.meta = meta

# %% Save 
data.write("Nr{}_fluxgenerator{}".format(42, ".asdf"))

# %% Later load data and meta-data again via python works fine
old_data = QTable.read("Nr42_fluxgenerator.asdf")

#%% Do some fancy matplotlib-plotting...

So my main purpose to use QTable and asdf at the moment is to store the measured data, units and experiment meta-data together to be able to do improved data-handling. This works as expected in python.

Now a non-python user finds the asdf file and would like to load it into any other GUI-based data-plotting program. But opening the file with a text-editor does not display the data block in a human readable way:

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
........
........
data: !<tag:astropy.org:astropy/table/table-1.0.0>
  colnames: [Curren, Fluxgenerated_flux]
  columns:
  - !unit/quantity-1.1.0
    unit: !unit/unit-1.0.0 A
    value: !core/ndarray-1.0.0
      source: 0
      datatype: float64
      byteorder: little
      shape: [20]
  - !unit/quantity-1.1.0
    unit: !unit/unit-1.0.0 flx
    value: !core/ndarray-1.0.0
      source: 1
      datatype: float64
      byteorder: little
      shape: [20]
  meta: {Experimentalist: LibrEars, measurement_type: flux_of_fluxgenerator, nr: 42,
    pix: 7, temperature: 37, time: 'Wed Oct  5 09:59:08 2022', voltage: -2}
  qtable: true
...
\D3BLK\000\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\A0\00\00\00\00\00\00\00\A0\00\00\00\00\00\00\00\A0s\E5\82<5\AC@\DD\D0\C5@s9\B3d7\00\00\00\00\00\00\00\00y
\E55\94\D7\F0?y
\E55\94\D7\00@6\94\D7P^C    @y
\E55\94\D7@\D7P^Cy
@6\94\D7P^C@\94\D7P^Cy@y
\E55\94\D7 @(\AF\A1\BC\86\F2"@\D7P^Cy
%@\86\F2\CAk('@6\94\D7P^C)@\E55\94\D7P^+@\94\D7P^Cy-@Cy
\E55\94/@y
\E55\94\D70@Q^Cy
\E51@(\AF\A1\BC\86\F22@\00\00\00\00\00\004@\D3BLK\000\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\A0\00\00\00\00\00\00\00\A0\00\00\00\00\00\00\00\A0O\D3\F7\C0\ABؔ\91\AD*Ӳ\B2{\A7\EB\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?\00\00\00\00\00\00\F0?#ASDF BLOCK INDEX
%YAML 1.1
---
- 1512
- 1726
...

So he might be confused about how to read the data. If, on the other hand, the file would open as follows, it would be compatible with everything that can read text and still contain the schema, units and meta-data for advanced usage:

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
........
........
  colnames: [Curren, Fluxgenerated_flux]
  columns:
  - !unit/quantity-1.1.0
    unit: !unit/unit-1.0.0 A
    value: !core/ndarray-1.0.0
      source: 0
      datatype: float64
      byteorder: little
      shape: [20]
  - !unit/quantity-1.1.0
    unit: !unit/unit-1.0.0 flx
    value: !core/ndarray-1.0.0
      source: 1
      datatype: float64
      byteorder: little
      shape: [20]
  meta: {Experimentalist: LibrEars, measurement_type: flux_of_fluxgenerator, nr: 42,
    pix: 7, temperature: 37, time: 'Wed Oct  5 09:59:08 2022', voltage: -2}
  qtable: true
...
Curren Fluxgenerated_flux
0.0 1.0
1.0526315789473684 1.0
2.1052631578947367 1.0
3.1578947368421053 1.0
4.2105263157894735 1.0
5.263157894736842 1.0
6.315789473684211 1.0
7.368421052631579 1.0
8.421052631578947 1.0
9.473684210526315 1.0
10.526315789473683 1.0
11.578947368421051 1.0
12.631578947368421 1.0
13.68421052631579 1.0
14.736842105263158 1.0
15.789473684210526 1.0
16.842105263157894 1.0
17.894736842105264 1.0
18.94736842105263 1.0
20.0 1.0
#ASDF BLOCK INDEX
%YAML 1.1
---
- 1512
- 1726
...

The use-case would be small measurements. Maybe a 'compressor' saving data in text instead of binary would be a solution (and some statement in the yaml-header about how to read / 'decompress' that block by asdf)?

perrygreenfield commented 2 years ago

You can supply a keyword argument to the write method as such:

data.write("Nr{}_fluxgenerator{}".format(42, ".asdf"), all_array_storage='inline')

Which will produce this form of the ASDF file:

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 2.11.2.dev15+g6703d8f.d20220729}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: !core/software-1.0.0 {name: asdf, version: 2.11.2.dev15+g6703d8f.d20220729}
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension._manifest.ManifestExtension
    extension_uri: asdf://astropy.org/astropy/extensions/astropy-1.0.0
    software: !core/software-1.0.0 {name: asdf-astropy, version: 0.2.1}
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension._manifest.ManifestExtension
    extension_uri: asdf://asdf-format.org/core/extensions/core-1.5.0
    software: !core/software-1.0.0 {name: asdf-astropy, version: 0.2.1}
data: !<tag:astropy.org:astropy/table/table-1.0.0>
  colnames: [Curren, Fluxgenerated_flux]
  columns:
  - !unit/quantity-1.1.0
    unit: !unit/unit-1.0.0 A
    value: !core/ndarray-1.0.0
      data: [0.0, 1.0526315789473684, 2.1052631578947367, 3.1578947368421053, 4.2105263157894735,
        5.263157894736842, 6.315789473684211, 7.368421052631579, 8.421052631578947,
        9.473684210526315, 10.526315789473683, 11.578947368421051, 12.631578947368421,
        13.68421052631579, 14.736842105263158, 15.789473684210526, 16.842105263157894,
        17.894736842105264, 18.94736842105263, 20.0]
      datatype: float64
      shape: [20]
  - !unit/quantity-1.1.0
    unit: !unit/unit-1.0.0 flx
    value: !core/ndarray-1.0.0
      data: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,
        1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
      datatype: float64
      shape: [20]
  meta: {Experimentalist: LibrEars, measurement_type: flux_of_fluxgenerator, nr: 42,
    pix: 7, temperature: 37, time: 'Wed Oct  5 08:20:11 2022', voltage: -2}
  qtable: true
...

Would this suffice for your needs?

LibrEars commented 2 years ago

Hi @perrygreenfield,

thank you for your suggestion. The ´all_array_storage='inline'´ keyword goes in the right direction. I did not find it in the astropy documentation (QTable.write.help("asdf") ), so thank you for pointing it out.

For the purpose of comparability/ accessibility to the data I would still feel an array like structure outside of the yaml would be more suitable, as most data-programs can import row-like data.

perrygreenfield commented 2 years ago

I think something we will be looking at soon is a way to support things other than arrays in binary blocks. But we are currently focussed on chunking support so it will have to wait after that (but may inform some changes we may need to make to support chunking with options for other kinds of content). Thanks for your feedback.

astropy / asdf-astropy

Does the data block needs to be binary? #118