`weldx.asdf.util.write_buffer` produces invalid yaml for empty inline ndarrays

It appears that ASDF internals are being patched to override writing of inline arrays. https://github.com/BAMWelDX/weldx/blob/c2ee9b1f7cb8df8b2f4723884341e9e20dcd0d50/weldx/asdf/util.py#L142-L143 which was introduced in https://github.com/BAMWelDX/weldx/pull/469

This is producing yaml (like the following except taken from the buff generated during test_write_buffer_dummy_inline_arrays that ASDF is unable to read:

large_array: !core/ndarray-1.0.0
  data: []
  datatype: float64
  shape: [50]

ASDF is unable to open this due to:

a bug in how empty inline arrays are read: https://github.com/asdf-format/asdf/issues/1538
the array, when loaded (the above issue fixed), has a shape that does not match shape (ASDF will raise a ValueError as seen here)

Would you be able to help me understand the reasoning behind this patching so that hopefully I can figure out how to accommodate this use of ASDF? To provide some context, I am working on a rather major rewrite of the ASDF block management code to move ndarray support to a new style Converter. This has revealed numerous ASDF issues and some of the fixes are impacting weldx (including revealing this issue). Thanks!

Hi @braingram , thank you for looking into it

The code section your are referring to - replacing inline data with empty arrays - was introduced solely for displaying nicely formatted ASDF/YAML outputs without

cluttering the output with long inline data (see example below)
having blocks (and block references) that can no longer be displayed as UTF-8 (see example below)

you can run the following in IPython (or a jupyter notebook)

import numpy as np
from weldx import WeldxFile

file = WeldxFile()
data = {"data_sets": {"first": np.random.random(100)}}
file["some_data"] = data
file.header(use_widgets=False)

which should give you something like the following

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 2.15.0}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: !core/software-1.0.0 {name: asdf, version: 2.15.0}
some_data:
  data_sets:
    first: !core/ndarray-1.0.0
      data: []
      datatype: float64
      shape: [100]

The idea here was to give a nice overview of an asdf-file layout without having to dive into blocks or very long outputs. The output created is only meant to look at and definitively not to be read or parsed as valid ASDF.

note that this was done for building documentation before rework of how asdf files can be displayed nicely - including block info - using the .. asdf:: sphinx directive (which I think was done in https://github.com/asdf-format/asdf/pull/1142/files)

However it is also nice to see if you are working in an IPython environment 😃

Regarding the block manager reworks: Feel free to adapt the block manager system regardless of failures related to this "hack", we are using it simple for educational and debugging purposes. I am sure there will be others ways to implement this again if needed.

For comparison with the above example: full inline file

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 2.15.0}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: !core/software-1.0.0 {name: asdf, version: 2.15.0}
some_data:
  data_sets:
    first: !core/ndarray-1.0.0
      data: [0.5183324061707298, 0.5864153406728267, 0.43129883329448515, 0.2966977592020418,
        0.8527967599717196, 0.23446144037963457, 0.9867198532590175, 0.06633670396891078,
        0.8521992505792656, 0.18231458053099048, 0.12300367951872859, 0.3441692161814621,
        0.3799700189678208, 0.013940636113156657, 0.15687623429607123, 0.40754367817412973,
        0.28729585238378985, 0.5439791537481328, 0.16313199436442805, 0.4732761002950674,
        0.9614629413891884, 0.18801270955047433, 0.9479895695724538, 0.1536923113058285,
        0.07763350723046869, 0.9547515474500247, 0.40575583086458167, 0.2704415444577999,
        0.8079131559671275, 0.8847253547307972, 0.10157186976238397, 0.7015105861330978,
        0.547818421014551, 0.9254624653981415, 0.3404192348403935, 0.881353024095736,
        0.8149270355334084, 0.6166549469105991, 0.15477183335846123, 0.5715250261569093,
        0.6078691478358348, 0.06141479982329068, 0.31847574695092873, 0.4275685731151513,
        0.48950691517656775, 0.643197261528769, 0.5173158893388426, 0.11736796192582533,
        0.43069271688850064, 0.28736650208640435, 0.9168623508825732, 0.5670388920969732,
        0.16125718985260573, 0.525039660922977, 0.6476174446555271, 0.5621777067737439,
        0.42761478703587164, 0.9354381681812435, 0.638297006362459, 0.19433030672474427,
        0.8693370415795185, 0.6921508043902841, 0.35408082903686244, 0.7549882446138901,
        0.9536138354817663, 0.8308795572217292, 0.1700452644969126, 0.6110646056765298,
        0.6091297397177062, 0.8306973093576878, 0.45809384566824796, 0.03964057529888587,
        0.4900629464491799, 0.8502110168764621, 0.8457705024294935, 0.19468205940998007,
        0.7361974986521657, 0.36191957170412603, 0.04762083582742371, 0.14554574036495704,
        0.3088550924709902, 0.6866807948258878, 0.5397687189503869, 0.38912267341435913,
        0.6044646474821497, 0.47002580735808297, 0.9188407991828724, 0.41494039111231307,
        0.8112617882697444, 0.6081429598258545, 0.018017923126878665, 0.9161420392081203,
        0.186062441001709, 0.24418392697834157, 0.09763210129872057, 0.8974307635776495,
        0.6530967122831681, 0.7305082521858869, 0.5194708845648303, 0.5470488523041508]
      datatype: float64
      shape: [100]
...

or including block data

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 2.15.0}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: !core/software-1.0.0 {name: asdf, version: 2.15.0}
some_data:
  data_sets:
    first: !core/ndarray-1.0.0
      source: 0
      datatype: float64
      byteorder: little
      shape: [100]
...
ÓBLK 0                             Äü=þºý›¿»¼øKÕr  æÚ_“›â?‡ÄŒÝ?—žÚŸã?OHT£ýÙí?Nàçòc?ß?À~#û,®? DàÙœ?þ@†±‡ß?ñ¡çÆjå?¬\›2î?¸z™ÀêÔ?Æ"Yw!Ü?À…ë`oP«? Êû™7Û?àØrK·?(ÇÎ!CË?ŠN9ÐÓé?Î!aë—‚ã?x‡h‡¾Ø?àZ8µJ‹å?¤êÁ¬/vÂ?·»ÐËW2à?‚yÍKÐ?xÙ-“&Ö?L’cgë?ÐáìÔÏ›¦?à
R¦&JÈ?ŸìsÄí?³FBšæ?y–FÈ¨é?ðšåcÛ¥?4>ÇnÆ?‡H§ö×â?Ê
Ç:áï?)2ÎˆAÿâ?Ä{Ožh+Â? vÞê¶êÛ?€–&jœàq?\bTç[Û?ýˆç%Üæ?bZ
Kó¡Ñ?êNápƒ@â?ìË¢Ì?ÀŽ;,mÉ?–2Wiêä?C’‘é?ÀÂ@{Ñ?‹ÛE×ESæ?"Â}cP´Ú?ˆ
–hÐÇ?†h¢á-ì?0-Ù/þÜ?ñÝ4â£á?^ý˜2Äé?ðÄ
Æª?ú„k)¼=Ò?‰²ˆýcëæ?½3E˜á?£H˜<¿Fä?†TU‹¹äÚ?,”ÏÏ|Ò?¥#½|çâ?HŒ¤Dî?³ZhúBùè?Óç¸Á}ê?.VÃj\õì? "9‘;Ö?;zà4äæ?K'³˜úÒî?Ä×v$¶¢Ó?¦}ûß?"û]€ÈÚ?Œ}²¥·Ç?¡3,Mé?Ÿ¥ /À~í?lf ýÍ°Ò?X\yôÏÅ?TTÈ¤Tªê?÷o=Süë?Õ'•’Ä+à?h}ï7¶à?B_ž^±iÚ?@ÊbaÎAž?:£"Þ“Ü?_—Çë£æ?ã€ ÖSá?„b$"u/Í?L&@¤ié?Ò6@?ä·òÀÏnÍ?Ng¦áöôÖ?ÜŸ?nÐ?f¹ÔHê?üTšXØ?W×Ah±ì?è?'ö•³?¿È
ÅñUï? 'sKæ?Æ?p¢Q¼ø·¢?`  W@¼1?#ASDF BLOCK INDEX
%YAML 1.1
---
- 553
...

Thanks for the explanation.

I will test out the example code you shared. It would be nice to have a nicer looking view of the blocks (or perhaps a way to hide them).

Looking at the code that generates the header: https://github.com/BAMWelDX/weldx/blob/c2ee9b1f7cb8df8b2f4723884341e9e20dcd0d50/weldx/asdf/file.py#L983-L994

What about setting all_array_storage to internal and sending the buff through weldx.asdf.util.get_yaml_header before passing it to the _show methods? This produces something like:

#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 3.0.0.dev327+g18ae7518.d20230504}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension._manifest.ManifestExtension
    extension_uri: asdf://asdf-format.org/core/extensions/core-1.5.0
    software: !core/software-1.0.0 {name: asdf, version: 3.0.0.dev327+g18ae7518.d20230504}
some_data:
  data_sets:
    first: !core/ndarray-1.0.0
      source: 0
      datatype: float64
      byteorder: little
      shape: [100]

It looks like weldx.asdf.util.write_buffer also has a dummy_arrays argument but this appears to be unused except for the test_write_buffer_dummy_inline_arrays test.

I think using this mock pattern was just a memory optimization. We are forced (are we?) to run through the serialization process to get the header. To avoid side effects during this phase, we take a deep copy of the current ASDF tree. When we serialize into the buffer we would effectively double the memory requirements, if we just copy the binary data of the ndarray buffers.

As this method is never (ever) used to deserialize again, we are just fine. Also the context manager of mock.patch removes this behavior after the block has been left.

Your suggestion to remove the optimization and solely rely on get_yaml_header would bloat the memory again as we need to write all arrays to the buffer.

It looks like weldx.asdf.util.write_buffer also has a dummy_arrays argument but this appears to be unused except for the test_write_buffer_dummy_inline_arrays test.

This code (3 line) duplication is due to the fact, that the function "write_buffer" creates its own ASDF tree. It exists for the sake of completeness.

Thanks for the detailed response!

I think using this mock pattern was just a memory optimization. We are forced (are we?) to run through the serialization process to get the header. To avoid side effects during this phase, we take a deep copy of the current ASDF tree. When we serialize into the buffer we would effectively double the memory requirements, if we just copy the binary data of the ndarray buffers.

You're correct that generating the header requires serializing the objects in the tree (more on this below).

As this method is never (ever) used to deserialize again, we are just fine. Also the context manager of mock.patch removes this behavior after the block has been left.

There is one test where the result is deserialized: https://github.com/BAMWelDX/weldx/blob/7b5c0c1fe5b770852caf3aa3d9f0e8aebf2934f8/weldx/tests/asdf_tests/test_asdf_util.py#L168-L171 This test is failing with some ASDF updates (which check the shape when the array is loaded). The above test is removed in: https://github.com/BAMWelDX/weldx/pull/875 so merging that PR should fix the this issue.

Your suggestion to remove the optimization and solely rely on get_yaml_header would bloat the memory again as we need to write all arrays to the buffer.

That's a good point about the memory usage and for large arrays seems problematic. One option that doesn't involve writing arrays to the buffer (as long as they're not inline) would be to use asdf.yamlutil.dump_tree. This will not produce yaml that exactly matches what would be produced by AsdfFile.write_to (the asdf_library, history and likely some other meta data are not updated during dump_tree and the ASDF and ASDF_STANDARD header comments will not be added).

Extending the above example:

import asdf
import io
import numpy as np
from weldx import WeldxFile

file = WeldxFile()
data = {"data_sets": {"first": np.random.random(100)}}
file["some_data"] = data
buff = io.BytesIO()
af = file._asdf_handle
asdf.yamlutil.dump_tree(af.tree, buff, af)
buff.seek(0)
print(buff.read().decode('ascii'))

Produces:

%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
  name: asdf, version: 2.15.0}
history:
  extensions:
  - !core/extension_metadata-1.0.0
    extension_class: asdf.extension.BuiltinExtension
    software: !core/software-1.0.0 {name: asdf, version: 2.15.0}
some_data:
  data_sets:
    first: !core/ndarray-1.0.0
      source: 0
      datatype: float64
      byteorder: little
      shape: [100]
...

I should note that asdf.yamlutil.dump_tree isn't currently in the asdf docs. We are planning to clean up the API and clarify what is public (the current working definition is that anything documented is part of the public API). I expect that asdf.yamlutil.dump_tree will be documented (and made public) however there is a chance we will make some changes to the function signature (such as changing the ctx to a serialization context). If you do end up using it shouldn't be an issue to make these changes in a backwards compatible way.

thank you for pointing out asdf.yamlutil.dump_tree, I think this would be a good substitution for our use case 👍 @braingram

BAMWelDX / weldx

`weldx.asdf.util.write_buffer` produces invalid yaml for empty inline ndarrays #874