Open braingram opened 1 year ago
Hi @braingram , thank you for looking into it
The code section your are referring to - replacing inline data with empty arrays - was introduced solely for displaying nicely formatted ASDF/YAML outputs without
you can run the following in IPython (or a jupyter notebook)
import numpy as np
from weldx import WeldxFile
file = WeldxFile()
data = {"data_sets": {"first": np.random.random(100)}}
file["some_data"] = data
file.header(use_widgets=False)
which should give you something like the following
#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
name: asdf, version: 2.15.0}
history:
extensions:
- !core/extension_metadata-1.0.0
extension_class: asdf.extension.BuiltinExtension
software: !core/software-1.0.0 {name: asdf, version: 2.15.0}
some_data:
data_sets:
first: !core/ndarray-1.0.0
data: []
datatype: float64
shape: [100]
The idea here was to give a nice overview of an asdf-file layout without having to dive into blocks or very long outputs.
The output created is only meant to look at and definitively not to be read or parsed as valid ASDF
.
note that this was done for building documentation before rework of how asdf files can be displayed nicely - including block info - using the .. asdf::
sphinx directive (which I think was done in https://github.com/asdf-format/asdf/pull/1142/files)
However it is also nice to see if you are working in an IPython environment 😃
Regarding the block manager reworks: Feel free to adapt the block manager system regardless of failures related to this "hack", we are using it simple for educational and debugging purposes. I am sure there will be others ways to implement this again if needed.
For comparison with the above example: full inline file
#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
name: asdf, version: 2.15.0}
history:
extensions:
- !core/extension_metadata-1.0.0
extension_class: asdf.extension.BuiltinExtension
software: !core/software-1.0.0 {name: asdf, version: 2.15.0}
some_data:
data_sets:
first: !core/ndarray-1.0.0
data: [0.5183324061707298, 0.5864153406728267, 0.43129883329448515, 0.2966977592020418,
0.8527967599717196, 0.23446144037963457, 0.9867198532590175, 0.06633670396891078,
0.8521992505792656, 0.18231458053099048, 0.12300367951872859, 0.3441692161814621,
0.3799700189678208, 0.013940636113156657, 0.15687623429607123, 0.40754367817412973,
0.28729585238378985, 0.5439791537481328, 0.16313199436442805, 0.4732761002950674,
0.9614629413891884, 0.18801270955047433, 0.9479895695724538, 0.1536923113058285,
0.07763350723046869, 0.9547515474500247, 0.40575583086458167, 0.2704415444577999,
0.8079131559671275, 0.8847253547307972, 0.10157186976238397, 0.7015105861330978,
0.547818421014551, 0.9254624653981415, 0.3404192348403935, 0.881353024095736,
0.8149270355334084, 0.6166549469105991, 0.15477183335846123, 0.5715250261569093,
0.6078691478358348, 0.06141479982329068, 0.31847574695092873, 0.4275685731151513,
0.48950691517656775, 0.643197261528769, 0.5173158893388426, 0.11736796192582533,
0.43069271688850064, 0.28736650208640435, 0.9168623508825732, 0.5670388920969732,
0.16125718985260573, 0.525039660922977, 0.6476174446555271, 0.5621777067737439,
0.42761478703587164, 0.9354381681812435, 0.638297006362459, 0.19433030672474427,
0.8693370415795185, 0.6921508043902841, 0.35408082903686244, 0.7549882446138901,
0.9536138354817663, 0.8308795572217292, 0.1700452644969126, 0.6110646056765298,
0.6091297397177062, 0.8306973093576878, 0.45809384566824796, 0.03964057529888587,
0.4900629464491799, 0.8502110168764621, 0.8457705024294935, 0.19468205940998007,
0.7361974986521657, 0.36191957170412603, 0.04762083582742371, 0.14554574036495704,
0.3088550924709902, 0.6866807948258878, 0.5397687189503869, 0.38912267341435913,
0.6044646474821497, 0.47002580735808297, 0.9188407991828724, 0.41494039111231307,
0.8112617882697444, 0.6081429598258545, 0.018017923126878665, 0.9161420392081203,
0.186062441001709, 0.24418392697834157, 0.09763210129872057, 0.8974307635776495,
0.6530967122831681, 0.7305082521858869, 0.5194708845648303, 0.5470488523041508]
datatype: float64
shape: [100]
...
or including block data
#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
name: asdf, version: 2.15.0}
history:
extensions:
- !core/extension_metadata-1.0.0
extension_class: asdf.extension.BuiltinExtension
software: !core/software-1.0.0 {name: asdf, version: 2.15.0}
some_data:
data_sets:
first: !core/ndarray-1.0.0
source: 0
datatype: float64
byteorder: little
shape: [100]
...
ÓBLK 0 Äü=þºý›¿»¼øKÕr æÚ_“›â?‡ÄŒÝ?—žÚŸã?OHT£ýÙí?Nàçòc?ß?À~#û,®? DàÙœ?þ@†±‡ß?ñ¡çÆjå?¬\›2î?¸z™ÀêÔ?Æ"Yw!Ü?À…ë`oP«? Êû™7Û?àØrK·?(ÇÎ!CË?ŠN9ÐÓé?Î!aë—‚ã?x‡h‡¾Ø?àZ8µJ‹å?¤êÁ¬/vÂ?·»ÐËW2à?‚yÍKÐ?xÙ-“&Ö?L’cgë?ÐáìÔÏ›¦?à
R¦&JÈ?ŸìsÄí?³FBšæ?y–FȨé?ðšåcÛ¥?4>ÇnÆ?‡H§ö×â?Ê
Ç:áï?)2ΈAÿâ?Ä{Ožh+Â? vÞê¶êÛ?€–&jœàq?\bTç[Û?ýˆç%Üæ?bZ
Kó¡Ñ?êNápƒ@â?ìË¢Ì?ÀŽ;,mÉ?–2Wiêä?C’‘é?ÀÂ@{Ñ?‹ÛE×ESæ?"Â}cP´Ú?ˆ
–hÐÇ?†h¢á-ì?0-Ù/þÜ?ñÝ4â£á?^ý˜2Äé?ðÄ
ƪ?ú„k)¼=Ò?‰²ˆýcëæ?½3E˜á?£H˜<¿Fä?†TU‹¹äÚ?,”ÏÏ|Ò?¥#½|çâ?HŒ¤Dî?³ZhúBùè?Óç¸Á}ê?.VÃj\õì? "9‘;Ö?;zà4äæ?K'³˜úÒî?Ä×v$¶¢Ó?¦}ûß?"û]€ÈÚ?Œ}²¥·Ç?¡3,Mé?Ÿ¥ /À~í?lf ýÍ°Ò?X\yôÏÅ?TTȤTªê?÷o=Süë?Õ'•’Ä+à?h}ï7¶à?B_ž^±iÚ?@ÊbaÎAž?:£"Þ“Ü?_—Çë£æ?〠ÖSá?„b$"u/Í?L&@¤ié?Ò6@?ä·òÀÏnÍ?Ng¦áöôÖ?ÜŸ?nÐ?f¹ÔHê?üTšXØ?W×Ah±ì?è?'ö•³?¿È
ÅñUï? 'sKæ?Æ?p¢Q¼ø·¢?` W@¼1?#ASDF BLOCK INDEX
%YAML 1.1
---
- 553
...
Thanks for the explanation.
I will test out the example code you shared. It would be nice to have a nicer looking view of the blocks (or perhaps a way to hide them).
Looking at the code that generates the header: https://github.com/BAMWelDX/weldx/blob/c2ee9b1f7cb8df8b2f4723884341e9e20dcd0d50/weldx/asdf/file.py#L983-L994
What about setting all_array_storage
to internal
and sending the buff
through weldx.asdf.util.get_yaml_header
before passing it to the _show
methods? This produces something like:
#ASDF 1.0.0
#ASDF_STANDARD 1.5.0
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
name: asdf, version: 3.0.0.dev327+g18ae7518.d20230504}
history:
extensions:
- !core/extension_metadata-1.0.0
extension_class: asdf.extension._manifest.ManifestExtension
extension_uri: asdf://asdf-format.org/core/extensions/core-1.5.0
software: !core/software-1.0.0 {name: asdf, version: 3.0.0.dev327+g18ae7518.d20230504}
some_data:
data_sets:
first: !core/ndarray-1.0.0
source: 0
datatype: float64
byteorder: little
shape: [100]
It looks like weldx.asdf.util.write_buffer
also has a dummy_arrays
argument but this appears to be unused except for the test_write_buffer_dummy_inline_arrays
test.
I think using this mock pattern was just a memory optimization. We are forced (are we?) to run through the serialization process to get the header. To avoid side effects during this phase, we take a deep copy of the current ASDF tree. When we serialize into the buffer we would effectively double the memory requirements, if we just copy the binary data of the ndarray buffers.
As this method is never (ever) used to deserialize again, we are just fine. Also the context manager of mock.patch removes this behavior after the block has been left.
Your suggestion to remove the optimization and solely rely on get_yaml_header would bloat the memory again as we need to write all arrays to the buffer.
It looks like weldx.asdf.util.write_buffer also has a dummy_arrays argument but this appears to be unused except for the test_write_buffer_dummy_inline_arrays test.
This code (3 line) duplication is due to the fact, that the function "write_buffer" creates its own ASDF tree. It exists for the sake of completeness.
Thanks for the detailed response!
I think using this mock pattern was just a memory optimization. We are forced (are we?) to run through the serialization process to get the header. To avoid side effects during this phase, we take a deep copy of the current ASDF tree. When we serialize into the buffer we would effectively double the memory requirements, if we just copy the binary data of the ndarray buffers.
You're correct that generating the header requires serializing the objects in the tree (more on this below).
As this method is never (ever) used to deserialize again, we are just fine. Also the context manager of mock.patch removes this behavior after the block has been left.
There is one test where the result is deserialized: https://github.com/BAMWelDX/weldx/blob/7b5c0c1fe5b770852caf3aa3d9f0e8aebf2934f8/weldx/tests/asdf_tests/test_asdf_util.py#L168-L171 This test is failing with some ASDF updates (which check the shape when the array is loaded). The above test is removed in: https://github.com/BAMWelDX/weldx/pull/875 so merging that PR should fix the this issue.
Your suggestion to remove the optimization and solely rely on get_yaml_header would bloat the memory again as we need to write all arrays to the buffer.
That's a good point about the memory usage and for large arrays seems problematic. One option that doesn't involve writing arrays to the buffer (as long as they're not inline) would be to use asdf.yamlutil.dump_tree
. This will not produce yaml that exactly matches what would be produced by AsdfFile.write_to
(the asdf_library
, history
and likely some other meta data are not updated during dump_tree
and the ASDF and ASDF_STANDARD header comments will not be added).
Extending the above example:
import asdf
import io
import numpy as np
from weldx import WeldxFile
file = WeldxFile()
data = {"data_sets": {"first": np.random.random(100)}}
file["some_data"] = data
buff = io.BytesIO()
af = file._asdf_handle
asdf.yamlutil.dump_tree(af.tree, buff, af)
buff.seek(0)
print(buff.read().decode('ascii'))
Produces:
%YAML 1.1
%TAG ! tag:stsci.edu:asdf/
--- !core/asdf-1.1.0
asdf_library: !core/software-1.0.0 {author: The ASDF Developers, homepage: 'http://github.com/asdf-format/asdf',
name: asdf, version: 2.15.0}
history:
extensions:
- !core/extension_metadata-1.0.0
extension_class: asdf.extension.BuiltinExtension
software: !core/software-1.0.0 {name: asdf, version: 2.15.0}
some_data:
data_sets:
first: !core/ndarray-1.0.0
source: 0
datatype: float64
byteorder: little
shape: [100]
...
I should note that asdf.yamlutil.dump_tree
isn't currently in the asdf docs. We are planning to clean up the API and clarify what is public (the current working definition is that anything documented is part of the public API). I expect that asdf.yamlutil.dump_tree
will be documented (and made public) however there is a chance we will make some changes to the function signature (such as changing the ctx to a serialization context). If you do end up using it shouldn't be an issue to make these changes in a backwards compatible way.
thank you for pointing out asdf.yamlutil.dump_tree
, I think this would be a good substitution for our use case 👍 @braingram
It appears that ASDF internals are being patched to override writing of inline arrays. https://github.com/BAMWelDX/weldx/blob/c2ee9b1f7cb8df8b2f4723884341e9e20dcd0d50/weldx/asdf/util.py#L142-L143 which was introduced in https://github.com/BAMWelDX/weldx/pull/469
This is producing yaml (like the following except taken from the
buff
generated duringtest_write_buffer_dummy_inline_arrays
that ASDF is unable to read:ASDF is unable to open this due to:
shape
(ASDF will raise aValueError
as seen here)Would you be able to help me understand the reasoning behind this patching so that hopefully I can figure out how to accommodate this use of ASDF? To provide some context, I am working on a rather major rewrite of the ASDF block management code to move ndarray support to a new style Converter. This has revealed numerous ASDF issues and some of the fixes are impacting weldx (including revealing this issue). Thanks!