Closed jamesrhester closed 2 years ago
Here are some fleshed-out definitions base on an email to the imgCIF mailing list at iucr.org. Summary: An image is located through location_uri + external_path + external_frame. It is processed externally according to the specifications in external_format + external_version, and then can be further interpreted using information in _array_structure and _array_structure_list to result in an array of integers or reals (this array not defined here).
save_array_data.external_format
_definition.id _array_data.external_format
_description.text
;
The format in which raw array data referenced by _array_data.location_uri and
_array_data.external_path can be accessed. The format specification can include
conversions performed within external libraries. Items in _array_structure and
_array_structure_list refer to the data after retrieval from the external libraries.
In combination with _array_data.format_version this should allow external tools
to correctly access external data.
;
loop_
_enumeration_set.state
_enumeration_set.details
CBF
;
The contents of the _array_data.data in a single-frame imgCBF file. Other
datanames in the external file are ignored.
;
SMV
;
An unprocessed sequence of bytes contained in a file conforming to the SMV
format as used by ADSC and other CCD manufacturers.
;
HDF5
;
A decompressed, 2-dimensional array of numbers corresponding to a single frame contained
in an HDF5 file as returned by HDF5 library functions. _array_data.external_path includes
both the directory path and internal HDF5 path.
;
MAR
;
An array of numbers corresponding to a decompressed single frame contained
within a MARCCD file (TIFF format).
;
Bruker
;
An array of numbers corresponding to a decompressed single frame from a
data file generated by Bruker equipment.
;
_name.category_id array_data
_name.object_id external_format
save_
#####
save_array_data.external_version
_definition.id '_array_data.external_version'
_description.text
;
An identifier for the version of the file format described by _array_data.external_format.
;
save_
#####
save_array_data.location_uri
_definition.id '_array_data.location_uri'
_description.text
;
A URI describing the location of an image external to the current data block
;
save_
#####
save_array_data.external_path
_definition.id '_array_data.external_path'
_description.text
;
A path that is used to locate the external image relative to _array_data.location_uri. This may include
both a directory structure and an internal format-dependent path.
;
save_
#####
save_array_data.external_frame
_definition.id '_array_data.external_frame'
_description.text
;
When the combination of _array_data.location_uri and _array_data.external_path
refer to a list of images, _array_data.external_frame is used to identify the image
position within that list, numbering from 1.
;
save_
And here are some rough examples of how this might look:
HDF5 file:
loop_
_array_data.array_id
_array_data.binary_id
_array_data.external_format
_array_data.location_uri
_array_data.external_path
_array_data.external_frame
1 1 HDF5 doi://123.456/jxr run1/tartaric.h5/entry1/detector1/data 1
1 2 HDF5 doi://123.456/jxr run1/tartaric.h5/entry1/detector1/data 2
...
Single-frame Bruker file
loop_
_array_data.array_id
_array_data.binary_id
_array_data.external_format
_array_data.external_version
_array_data.location_uri
_array_data.external_path
1 1 Bruker Smart6000 https://uni_repo.edu/5341 run1/tartaric.001
1 2 Bruker Smart6000 https://uni_repo.edu/5341 run1/tartaric.002
I've now prepared text for inclusion in the DDL2 version of imgCIF.dic. Presumably some of the explanatory text in imgCIF.dic could also be expanded to include these new data names. Note that I have tweaked the above definitions to also include provision for compressed archived data. Please see the examples in the definition below for further information.
# Definitions for linking to external images from within an imgCIF
# file (DDL2)
# An file-like object is located using location_uri. If this object is a
# compressed archive (e.g. zip or .tar.gz) the archive format and
# location within the archive are given by
# _array_data.external_archive_format and _array_data.external_archive_path.
# The metadata in the imgCIF file (for
# example the information in _array_structure and
# _array_structure_list) refers to the data at that location
# interpreted according to external_format + external_version with
# frame chosen according to _array_data.external_frame.
save_array_data.external_format
_item_description.description
;
The format in which raw array data referenced by
_array_data.location_uri and following archive extraction can be
accessed. Items in array_structure and array_structure_list refer
to the data after any decompressions and other transformations
performed by standard libraries associated with the format as
described in the description of each format below.
;
_item.name '_array_data.external_format'
_item.category_id array_data
_item.mandatory_code no
_item.type_code code
loop_
_item_enumeration.value
_item_enumeration.detail
CBF
;
The contents of _array_data.data in a single-frame imgCBF file. Other
datanames in the external file are ignored.
;
SMV
;
An unprocessed sequence of bytes contained in a file conforming to the SMV
format as used by ADSC and other CCD manufacturers.
;
HDF5
;
A decompressed, 2-dimensional array of numbers corresponding to a single frame contained
in an HDF5 file as returned by HDF5 library functions. _array_data.external_path is the
internal HDF5 path.
;
MAR
;
An array of numbers corresponding to a decompressed single frame contained
within a MARCCD file (TIFF format).
;
Bruker
;
An array of numbers corresponding to a decompressed single frame from a
data file generated by Bruker equipment.
;
loop_
_item_examples.case
_item_examples.detail
;
loop_
_array_data.array_id
_array_data.binary_id
_array_data.external_format
_array_data.location_uri
_array_data.external_path
_array_data.external_frame
1 1 HDF5 https://zenodo.org/record/12345/files/tartaric.h5 /entry1/detector1/data 1
1 2 HDF5 https://zenodo.org/record/12345/files/tartaric.h5 /entry1/detector1/data 2
...
;
;
The frames are contained in a single HDF5-format file accessible
at https://zenodo.org/record/12345/files/tartaric.h5. An array of 2D
images is found at HDF5 location entry1/detector1/data
;
;
loop_
_array_data.array_id
_array_data.binary_id
_array_data.external_format
_array_data.external_version
_array_data.location_uri
1 1 Bruker Smart6000 https://uni_repo.edu/5341/run1/tartaric.001
1 2 Bruker Smart6000 https://uni_repo.edu/5341/run1/tartaric.002
...
;
;
Frames are contained in individual Smart6000 Bruker-format files
accessible using https://uni_repo.edu/5341 in subdirectory run1.
;
loop_
_array_data.array_id
_array_data.binary_id
_array_data.external_format
_array_data.location_uri
_array_data.external_archive_format
_array_data.external_archive_path
1 1 SMV
https://data.proteindiffraction.org/ssgcid/MyulA_01062_a_B12-sddc0001574_7k69.tar.bz2
TBZ
MyulA_01062_a_B12-sddc0001574_7k69/data/317895h4_y_0001.img
1 2 SMV
https://data.proteindiffraction.org/ssgcid/MyulA_01062_a_B12-sddc0001574_7k69.tar.bz2
TBZ
MyulA_01062_a_B12-sddc0001574_7k69/data/317895h4_y_0002.img
;
;
Frames with SMV format are contained at data.proteindiffraction.org in a tarred
archive compressed with bzip2.
;
save_
save_array_data.external_version
_item_description.description
;
An identifier for the version of the file format described by _array_data.external_format.
;
_item.name '_array_data.external_version'
_item.category_id array_data
_item_type.code code
save_
save_array_data.location_uri
_item_description.description
;
A URI describing the location of an image external to the current data block
;
_item.name '_array_data.location_uri'
_item.category_id array_data
_item_type.code code
save_
save_array_data.external_path
_item_description.description
;
An optional format-dependent path that is used to locate the
external image within the object referenced by the combination of
_array_data.location_uri and _array_data.external_archive_path.
;
_item.name '_array_data.external_path'
_item.category_id array_data
_item_type.code text
save_
save_array_data.external_frame
_item_description.description
;
The position of the raw frame in the list of images referenced by
_array_data.location_uri and _array_data.external_path, counting from
1.
;
_item.name '_array_data.external_frame'
_item.category_id array_data
_item_type.code int
_item_default.value 1
save_
save_array_data.external_archive_path
_item_description.description
;
The location of the image within an archive.
;
_item.name '_array_data.external_path'
_item.category_id array_data
_item_type.code text
save_
save_array_data.external_archive_format
_item_description.description
;
The type of single-file archive in which image data have been encapsulated,
if any. The archive is located in the position referenced by
_array_data.location_uri.
;
_item.name '_array_data.external_archive_format'
_item.category_id array_data
_item.mandatory_code no
_item.type_code code
_item_default.value .
loop_
_item_enumeration.value
_item_enumeration.detail
ZIP 'A ZIP archive'
TGZ 'A Gzipped tar archive'
TBZ 'A Bzip2 tar archive'
. 'No compressed archive is present'
Dear James,
This is fine for a single frame, but needs an extension to handle the now common case with HDF5 that the archive is a raster or scan containing blocks of multiple images in one or more arrays, i.e. more detail than _array_data.external_archive_path to specify a particular image is needed.
I think this might best be handled by using the HDF5 Virtual Dataset concepts and providing equivalent tags, especially the ones for handling hyperslabs. See
https://support.hdfgroup.org/HDF5/docNewFeatures/VDS/HDF5-VDS-requirements-use-cases-2014-12-10.pdf
If this would take us too far into the HDF5 weeds, for this particular purpose, I believe we need to be able to apply the array_section semantics already in imgCIF to specify a particular section of an external array to be treated as an imgCIF image.
One slightly mind-bending aspect we need to be sure to include is when in a raster-scan the order of images in an array of images reverses for alternate rows, which is actually a very common case. Failure to include the necessary information for that with the images can result in very wrong heat maps.
Regards, Herbert
On Mon, Apr 12, 2021 at 2:25 AM jamesrhester @.***> wrote:
I've now prepared text for inclusion in the DDL2 version of imgCIF.dic. Presumably some of the explanatory text in imgCIF.dic could also be expanded to include these new data names. Note that I have tweaked the above definitions to also include provision for compressed archived data. Please see the examples in the definition below for further information.
Definitions for linking to external images from within an imgCIF
file (DDL2)
An file-like object is located using location_uri. If this object is a
compressed archive (e.g. zip or .tar.gz) the archive format and
location within the archive are given by
_array_data.external_archive_format and _array_data.external_archive_path.
The metadata in the imgCIF file (for
example the information in _array_structure and
_array_structure_list) refers to the data at that location
interpreted according to external_format + external_version with
frame chosen according to _array_data.external_frame.
save_array_data.external_format _item_description.description ;
The format in which raw array data referenced by _array_data.location_uri and following archive extraction can be accessed. Items in array_structure and array_structure_list refer to the data after any decompressions and other transformations performed by standard libraries associated with the format as described in the description of each format below.
;
_item.name '_array_data.external_format' _item.category_id array_data _item.mandatory_code no _item.type_code code loop_ _item_enumeration.value _item_enumeration.detail CBF
; The contents of _array_data.data in a single-frame imgCBF file. Other datanames in the external file are ignored. ; SMV ; An unprocessed sequence of bytes contained in a file conforming to the SMV format as used by ADSC and other CCD manufacturers. ; HDF5 ; A decompressed, 2-dimensional array of numbers corresponding to a single frame contained in an HDF5 file as returned by HDF5 library functions. _array_data.external_path is the internal HDF5 path. ; MAR ; An array of numbers corresponding to a decompressed single frame contained within a MARCCD file (TIFF format). ; Bruker ; An array of numbers corresponding to a decompressed single frame from a data file generated by Bruker equipment. ;
loop_ _item_examples.case _item_examples.detail
; loop_ _array_data.array_id _array_data.binary_id _array_data.external_format _array_data.location_uri _array_data.external_path _array_data.external_frame 1 1 HDF5 https://zenodo.org/record/12345/files/tartaric.h5 /entry1/detector1/data 1 1 2 HDF5 https://zenodo.org/record/12345/files/tartaric.h5 /entry1/detector1/data 2 ... ; ; The frames are contained in a single HDF5-format file accessible at https://zenodo.org/record/12345/files/tartaric.h5. An array of 2D images is found at HDF5 location entry1/detector1/data ;
; loop_ _array_data.array_id _array_data.binary_id _array_data.external_format _array_data.external_version _array_data.location_uri 1 1 Bruker Smart6000 https://uni_repo.edu/5341/run1/tartaric.001 1 2 Bruker Smart6000 https://uni_repo.edu/5341/run1/tartaric.002 ... ;
; Frames are contained in individual Smart6000 Bruker-format files accessible using https://uni_repo.edu/5341 in subdirectory run1. ; loop_ _array_data.array_id _array_data.binary_id _array_data.external_format _array_data.location_uri _array_data.external_archive_format _array_data.external_archive_path 1 1 SMV https://data.proteindiffraction.org/ssgcid/MyulA_01062_a_B12-sddc0001574_7k69.tar.bz2 TBZ MyulA_01062_a_B12-sddc0001574_7k69/data/317895h4_y_0001.img 1 2 SMV https://data.proteindiffraction.org/ssgcid/MyulA_01062_a_B12-sddc0001574_7k69.tar.bz2 TBZ MyulA_01062_a_B12-sddc0001574_7k69/data/317895h4_y_0002.img ;
;
Frames with SMV format are contained at data.proteindiffraction.org in a tarred archive compressed with bzip2.
; save_
save_array_data.external_version _item_description.description ; An identifier for the version of the file format described by _array_data.external_format. ;
_item.name '_array_data.external_version' _item.category_id array_data _item_type.code code
save_
save_array_data.location_uri _item_description.description ; A URI describing the location of an image external to the current data block ; _item.name '_array_data.location_uri' _item.category_id array_data _item_type.code code
save_
save_array_data.external_path _item_description.description
;
An optional format-dependent path that is used to locate the external image within the object referenced by the combination of _array_data.location_uri and _array_data.external_archive_path.
;
_item.name '_array_data.external_path' _item.category_id array_data _item_type.code text save_
save_array_data.external_frame _item_description.description ;
The position of the raw frame in the list of images referenced by _array_data.location_uri and _array_data.external_path, counting from 1.
; _item.name '_array_data.external_frame' _item.category_id array_data _item_type.code int _item_default.value 1
save_
save_array_data.external_archive_path _item_description.description
;
The location of the image within an archive.
;
_item.name '_array_data.external_path' _item.category_id array_data _item_type.code text save_
save_array_data.external_archive_format _item_description.description ;
The type of single-file archive in which image data have been encapsulated, if any. The archive is located in the position referenced by _array_data.location_uri.
;
_item.name '_array_data.external_archive_format' _item.category_id array_data _item.mandatory_code no _item.type_code code _item_default.value . loop_ _item_enumeration.value _item_enumeration.detail ZIP 'A ZIP archive' TGZ 'A Gzipped tar archive' TBZ 'A Bzip2 tar archive' . 'No compressed archive is present'
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/COMCIFS/imgCIF/issues/7#issuecomment-817521786, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABB6EALB757GNGLVDMMZ453TIKG6ZANCNFSM4G3W25TA .
_array_data.location
, and _array_data.external_path
would be an HDF5 path in this virtual dataset. The HDF5 libraries would be responsible for taking this path and retrieving the appropriate data. Is this understanding incorrect?_array_data.external_frame
which is an integer. Is this insufficient, given an HDF5 path recognised by HDF5 libraries, to specify a particular 2D image?array_section
. The idea is that whatever data is referenced by the new tags would then be subject to any and all transformations specified in the remainder of the imgCIF file, that is, the data referenced by the new tags fulfills the same role as _array_data.data
. So in particular, any information in array_section
is relevant._array_data.external_frame
then the images can be reordered according to whatever the imgCIF metadata specifies. Hopefully I've understood the issue correctly.We need a meeting on this that includes at least Aaron Brewster, Graeme Winter, you and me to flesh out your proposal and give examples of the use cases.
I think we are talking about different things. I am purely interested in a way to identify a particular frame in an HDF5 file. If that HDF5 file is a VDS, then that frame is monolithic and the details of its assembly are irrelevant. If the details of the frame's assembly are actually relevant, then the path to the constituent HDF5 files is provided instead and the metadata for assembly into the final frame is described using the usual imgCIF tags.
- Yes the HDF5 libraries are responsible for most of the work, but for this purpose you would need to specify what frame, where, is intended to be the one being picked up. The path alone is not sufficient.
As I said, the path plus frame number should be sufficient. Section 5.3 and 5.5 on page 22 of the document you linked state that access to a virtual dataset looks just like access to any other HDF5 dataset. The only new options are those given in 5.5.1, 5.5.2 and 5.5.3, for which reasonable defaults can be debated and added to the description of HDF5 format above.
- The frame number itself is not meaningful without the specifics of what hyperslab within the given array you intend. All you know from the path is the overall array.
But the path + frame number is enough as per pg 22 referenced above. If you did for some reason want to recreate the HDF5 reconstruction, then you would instead reference the component HDF5 files (not the VDS) and you are once again in good shape with filename + HDF5 path + frame number together with imgCIF tags in array_section et. al.
- That is just what array_section is about, but in this case we have two levels of sections to deal with the sections within the NeXus array specified by your path in the NeXus file which is then the array within which we need to specify sections on the CBF side. With something as complex as CSPAD data, waving our hands and telling people to deal with it without clearly worked out examples will lead to major confusion and dialects.
No. Either the image delivered by the VDS is considered monolithic, and therefore HDF5 path + frame number for the VDS is sufficient, or else it is seen as being assembled out of sections, in which case you use the constituent HDF5 files + internal paths + frame numbers and duplicate the HDF5 VDS assembly using imgCIF descriptors, your choice. No handwaving is going on, but we could add more text to the HDF5 description making explicit the point that HDF5 VDS frames are considered monolithic, and if you want to reference constituent sections then you should reference the particular HDF5 files.
- The raster layout is major metadata. We need to nail it down.
Not in this proposal, as explained above, but if there is a gap in imgCIF metadata in e.g. array_section a separate issue can be opened.
We need a meeting on this that includes at least Aaron Brewster, Graeme Winter, you and me to flesh out your proposal and give examples of the use cases.
I note the HDF5 VDS document you linked has specific examples from DLS. Happy to meet, but before taking up everyone's time with a meeting, I suggest pointing Graeme and Aaron to this comment so that they can identify any errors, and if I am indeed being too simplistic then let's meet. As always, having these technical conversations in writing here or on the imgCIF list leaves a more permanent record and gives more time to think than a meeting, which is why I prefer them.
I agree we are talking about different things. You are talking about an abstract concept with which I agree, while I am talking about the practical realities of using that abstract concept to actually process data, which needs additional supporting tags to actually work. You and I could go back and forth for a very long time and come to something the might work, or we could bring in some of our colleagues, Aaron and Graeme, with practical experience at working with these issues and come to a really good set of definitions to resolve this much faster. I am glad you are open to bringing them in. I will send them an email.
Closing as the relevant tags are now in the dictionary.
As a serial format, imgCIF is not suitable for storing large quantities of image data. However, it should be possible to refer to externally-stored data in a way that would allow tools to access that data. Definitions should be developed for doing this.