COMCIFS / imgCIF

Development of the imgCIF dictionary
0 stars 1 forks source link

Provide a way to link to external image data #7

Closed jamesrhester closed 2 years ago

jamesrhester commented 5 years ago

As a serial format, imgCIF is not suitable for storing large quantities of image data. However, it should be possible to refer to externally-stored data in a way that would allow tools to access that data. Definitions should be developed for doing this.

jamesrhester commented 5 years ago

Here are some fleshed-out definitions base on an email to the imgCIF mailing list at iucr.org. Summary: An image is located through location_uri + external_path + external_frame. It is processed externally according to the specifications in external_format + external_version, and then can be further interpreted using information in _array_structure and _array_structure_list to result in an array of integers or reals (this array not defined here).

save_array_data.external_format
_definition.id    _array_data.external_format
_description.text
;
    The format in which raw array data referenced by _array_data.location_uri and
 _array_data.external_path can be accessed. The format specification can include 
conversions performed within external libraries. Items in _array_structure and 
_array_structure_list refer to the data after retrieval from the external libraries.
 In combination with _array_data.format_version this should allow external tools
 to correctly access external data.
;
loop_
_enumeration_set.state
_enumeration_set.details
CBF
;
The contents of the _array_data.data  in a single-frame imgCBF file. Other
datanames in the external file are ignored.
;
SMV
;
An unprocessed sequence of bytes contained in a file conforming to the SMV 
format as used by ADSC and other CCD manufacturers.
;
HDF5
;
A decompressed, 2-dimensional array of numbers corresponding to a single frame contained 
in an HDF5 file as returned by HDF5 library functions. _array_data.external_path includes 
both the directory path and internal HDF5 path. 
;
MAR
;
An array of numbers corresponding to a decompressed single frame contained
 within a MARCCD file (TIFF format).
;
Bruker
;
An array of numbers corresponding to a decompressed single frame from a 
data file generated by Bruker equipment.
;
_name.category_id         array_data
_name.object_id             external_format
save_

#####

save_array_data.external_version
_definition.id                 '_array_data.external_version'
_description.text
;
   An identifier for the version of the file format described by _array_data.external_format. 
;
save_

#####

save_array_data.location_uri
_definition.id        '_array_data.location_uri'
_description.text
;
A URI describing the location of an image external to the current data block
;
save_

#####

save_array_data.external_path
_definition.id       '_array_data.external_path'
_description.text
;
A path that is used to locate the external image relative to _array_data.location_uri. This may include
both a directory structure and an internal format-dependent path.
;
save_

#####

save_array_data.external_frame
_definition.id      '_array_data.external_frame'
_description.text
;
When the combination of _array_data.location_uri and _array_data.external_path 
refer to a list of images, _array_data.external_frame is used to identify the image 
position within that list, numbering from 1.
;
save_
jamesrhester commented 5 years ago

And here are some rough examples of how this might look:

HDF5 file:

loop_
_array_data.array_id
_array_data.binary_id
_array_data.external_format
_array_data.location_uri
_array_data.external_path
_array_data.external_frame
1 1 HDF5 doi://123.456/jxr  run1/tartaric.h5/entry1/detector1/data 1
1 2 HDF5 doi://123.456/jxr  run1/tartaric.h5/entry1/detector1/data 2
...

Single-frame Bruker file


loop_
_array_data.array_id
_array_data.binary_id
_array_data.external_format
_array_data.external_version
_array_data.location_uri
_array_data.external_path
1 1 Bruker Smart6000 https://uni_repo.edu/5341  run1/tartaric.001 
1 2 Bruker Smart6000 https://uni_repo.edu/5341  run1/tartaric.002
jamesrhester commented 3 years ago

I've now prepared text for inclusion in the DDL2 version of imgCIF.dic. Presumably some of the explanatory text in imgCIF.dic could also be expanded to include these new data names. Note that I have tweaked the above definitions to also include provision for compressed archived data. Please see the examples in the definition below for further information.

# Definitions for linking to external images from within an imgCIF
# file (DDL2)

# An file-like object is located using location_uri.  If this object is a
# compressed archive (e.g. zip or .tar.gz) the archive format and
# location within the archive are given by
# _array_data.external_archive_format and _array_data.external_archive_path.
# The metadata in the imgCIF file (for
# example the information in _array_structure and
# _array_structure_list) refers to the data at that location
# interpreted according to external_format + external_version with
# frame chosen according to _array_data.external_frame.

save_array_data.external_format
    _item_description.description
;

    The format in which raw array data referenced by
    _array_data.location_uri and following archive extraction can be
    accessed. Items in array_structure and array_structure_list refer
    to the data after any decompressions and other transformations
    performed by standard libraries associated with the format as
    described in the description of each format below.

;

    _item.name                   '_array_data.external_format'
    _item.category_id            array_data
    _item.mandatory_code         no
    _item.type_code              code

    loop_
    _item_enumeration.value
    _item_enumeration.detail
    CBF
;
    The contents of _array_data.data in a single-frame imgCBF file. Other
    datanames in the external file are ignored.
;
    SMV
;
    An unprocessed sequence of bytes contained in a file conforming to the SMV 
    format as used by ADSC and other CCD manufacturers.
;
    HDF5
;
    A decompressed, 2-dimensional array of numbers corresponding to a single frame contained 
    in an HDF5 file as returned by HDF5 library functions. _array_data.external_path is the
    internal HDF5 path. 
;
    MAR
;
    An array of numbers corresponding to a decompressed single frame contained
    within a MARCCD file (TIFF format).
;
    Bruker
;
    An array of numbers corresponding to a decompressed single frame from a 
    data file generated by Bruker equipment.
;

    loop_
      _item_examples.case
      _item_examples.detail
;
    loop_
    _array_data.array_id
    _array_data.binary_id
    _array_data.external_format
    _array_data.location_uri
    _array_data.external_path
    _array_data.external_frame
    1 1 HDF5 https://zenodo.org/record/12345/files/tartaric.h5 /entry1/detector1/data 1
    1 2 HDF5 https://zenodo.org/record/12345/files/tartaric.h5 /entry1/detector1/data 2
    ...
;
;
    The frames are contained in a single HDF5-format file accessible
    at https://zenodo.org/record/12345/files/tartaric.h5. An array of 2D
    images is found at HDF5 location entry1/detector1/data
;

;
    loop_
    _array_data.array_id
    _array_data.binary_id
    _array_data.external_format
    _array_data.external_version
    _array_data.location_uri
    1 1 Bruker Smart6000 https://uni_repo.edu/5341/run1/tartaric.001 
    1 2 Bruker Smart6000 https://uni_repo.edu/5341/run1/tartaric.002
    ...
;

;
    Frames are contained in individual Smart6000 Bruker-format files
    accessible using https://uni_repo.edu/5341 in subdirectory run1.
;
    loop_
    _array_data.array_id
    _array_data.binary_id
    _array_data.external_format
    _array_data.location_uri
    _array_data.external_archive_format
    _array_data.external_archive_path
    1 1 SMV
        https://data.proteindiffraction.org/ssgcid/MyulA_01062_a_B12-sddc0001574_7k69.tar.bz2
        TBZ
        MyulA_01062_a_B12-sddc0001574_7k69/data/317895h4_y_0001.img
    1 2 SMV
        https://data.proteindiffraction.org/ssgcid/MyulA_01062_a_B12-sddc0001574_7k69.tar.bz2
        TBZ
        MyulA_01062_a_B12-sddc0001574_7k69/data/317895h4_y_0002.img
;

;

    Frames with SMV format are contained at data.proteindiffraction.org in a tarred
    archive compressed with bzip2.

;
    save_

save_array_data.external_version
    _item_description.description
;
    An identifier for the version of the file format described by _array_data.external_format. 
;

    _item.name                '_array_data.external_version'
    _item.category_id         array_data
    _item_type.code           code

save_

save_array_data.location_uri
    _item_description.description
;
    A URI describing the location of an image external to the current data block
;
    _item.name                  '_array_data.location_uri'
    _item.category_id           array_data
    _item_type.code             code

    save_

save_array_data.external_path
   _item_description.description

;

    An optional format-dependent path that is used to locate the
    external image within the object referenced by the combination of
    _array_data.location_uri and _array_data.external_archive_path.

;

    _item.name                   '_array_data.external_path'
    _item.category_id            array_data
    _item_type.code              text

    save_

save_array_data.external_frame
    _item_description.description
;

    The position of the raw frame in the list of images referenced by
    _array_data.location_uri and _array_data.external_path, counting from
    1.

;
    _item.name                   '_array_data.external_frame'
    _item.category_id            array_data
    _item_type.code              int
    _item_default.value          1

    save_

save_array_data.external_archive_path
   _item_description.description

;

    The location of the image within an archive.

;

    _item.name                   '_array_data.external_path'
    _item.category_id            array_data
    _item_type.code              text

    save_

save_array_data.external_archive_format
    _item_description.description
;

    The type of single-file archive in which image data have been encapsulated,
    if any. The archive is located in the position referenced by
    _array_data.location_uri.

;

    _item.name                   '_array_data.external_archive_format'
    _item.category_id            array_data
    _item.mandatory_code         no
    _item.type_code              code
    _item_default.value          .

    loop_
    _item_enumeration.value
    _item_enumeration.detail
    ZIP           'A ZIP archive'
    TGZ           'A Gzipped tar archive'
    TBZ           'A Bzip2 tar archive'
    .             'No compressed archive is present'
yayahjb commented 3 years ago

Dear James,

This is fine for a single frame, but needs an extension to handle the now common case with HDF5 that the archive is a raster or scan containing blocks of multiple images in one or more arrays, i.e. more detail than _array_data.external_archive_path to specify a particular image is needed.

I think this might best be handled by using the HDF5 Virtual Dataset concepts and providing equivalent tags, especially the ones for handling hyperslabs. See

https://support.hdfgroup.org/HDF5/docNewFeatures/VDS/HDF5-VDS-requirements-use-cases-2014-12-10.pdf

If this would take us too far into the HDF5 weeds, for this particular purpose, I believe we need to be able to apply the array_section semantics already in imgCIF to specify a particular section of an external array to be treated as an imgCIF image.

One slightly mind-bending aspect we need to be sure to include is when in a raster-scan the order of images in an array of images reverses for alternate rows, which is actually a very common case. Failure to include the necessary information for that with the images can result in very wrong heat maps.

Regards, Herbert

On Mon, Apr 12, 2021 at 2:25 AM jamesrhester @.***> wrote:

I've now prepared text for inclusion in the DDL2 version of imgCIF.dic. Presumably some of the explanatory text in imgCIF.dic could also be expanded to include these new data names. Note that I have tweaked the above definitions to also include provision for compressed archived data. Please see the examples in the definition below for further information.

Definitions for linking to external images from within an imgCIF

file (DDL2)

An file-like object is located using location_uri. If this object is a

compressed archive (e.g. zip or .tar.gz) the archive format and

location within the archive are given by

_array_data.external_archive_format and _array_data.external_archive_path.

The metadata in the imgCIF file (for

example the information in _array_structure and

_array_structure_list) refers to the data at that location

interpreted according to external_format + external_version with

frame chosen according to _array_data.external_frame.

save_array_data.external_format _item_description.description ;

The format in which raw array data referenced by
_array_data.location_uri and following archive extraction can be
accessed. Items in array_structure and array_structure_list refer
to the data after any decompressions and other transformations
performed by standard libraries associated with the format as
described in the description of each format below.

;

_item.name                   '_array_data.external_format'
_item.category_id            array_data
_item.mandatory_code         no
_item.type_code              code

loop_
_item_enumeration.value
_item_enumeration.detail
CBF

; The contents of _array_data.data in a single-frame imgCBF file. Other datanames in the external file are ignored. ; SMV ; An unprocessed sequence of bytes contained in a file conforming to the SMV format as used by ADSC and other CCD manufacturers. ; HDF5 ; A decompressed, 2-dimensional array of numbers corresponding to a single frame contained in an HDF5 file as returned by HDF5 library functions. _array_data.external_path is the internal HDF5 path. ; MAR ; An array of numbers corresponding to a decompressed single frame contained within a MARCCD file (TIFF format). ; Bruker ; An array of numbers corresponding to a decompressed single frame from a data file generated by Bruker equipment. ;

loop_
  _item_examples.case
  _item_examples.detail

; loop_ _array_data.array_id _array_data.binary_id _array_data.external_format _array_data.location_uri _array_data.external_path _array_data.external_frame 1 1 HDF5 https://zenodo.org/record/12345/files/tartaric.h5 /entry1/detector1/data 1 1 2 HDF5 https://zenodo.org/record/12345/files/tartaric.h5 /entry1/detector1/data 2 ... ; ; The frames are contained in a single HDF5-format file accessible at https://zenodo.org/record/12345/files/tartaric.h5. An array of 2D images is found at HDF5 location entry1/detector1/data ;

; loop_ _array_data.array_id _array_data.binary_id _array_data.external_format _array_data.external_version _array_data.location_uri 1 1 Bruker Smart6000 https://uni_repo.edu/5341/run1/tartaric.001 1 2 Bruker Smart6000 https://uni_repo.edu/5341/run1/tartaric.002 ... ;

; Frames are contained in individual Smart6000 Bruker-format files accessible using https://uni_repo.edu/5341 in subdirectory run1. ; loop_ _array_data.array_id _array_data.binary_id _array_data.external_format _array_data.location_uri _array_data.external_archive_format _array_data.external_archive_path 1 1 SMV https://data.proteindiffraction.org/ssgcid/MyulA_01062_a_B12-sddc0001574_7k69.tar.bz2 TBZ MyulA_01062_a_B12-sddc0001574_7k69/data/317895h4_y_0001.img 1 2 SMV https://data.proteindiffraction.org/ssgcid/MyulA_01062_a_B12-sddc0001574_7k69.tar.bz2 TBZ MyulA_01062_a_B12-sddc0001574_7k69/data/317895h4_y_0002.img ;

;

Frames with SMV format are contained at data.proteindiffraction.org in a tarred
archive compressed with bzip2.

; save_

save_array_data.external_version _item_description.description ; An identifier for the version of the file format described by _array_data.external_format. ;

_item.name                '_array_data.external_version'
_item.category_id         array_data
_item_type.code           code

save_

save_array_data.location_uri _item_description.description ; A URI describing the location of an image external to the current data block ; _item.name '_array_data.location_uri' _item.category_id array_data _item_type.code code

save_

save_array_data.external_path _item_description.description

;

An optional format-dependent path that is used to locate the
external image within the object referenced by the combination of
_array_data.location_uri and _array_data.external_archive_path.

;

_item.name                   '_array_data.external_path'
_item.category_id            array_data
_item_type.code              text

save_

save_array_data.external_frame _item_description.description ;

The position of the raw frame in the list of images referenced by
_array_data.location_uri and _array_data.external_path, counting from
1.

; _item.name '_array_data.external_frame' _item.category_id array_data _item_type.code int _item_default.value 1

save_

save_array_data.external_archive_path _item_description.description

;

The location of the image within an archive.

;

_item.name                   '_array_data.external_path'
_item.category_id            array_data
_item_type.code              text

save_

save_array_data.external_archive_format _item_description.description ;

The type of single-file archive in which image data have been encapsulated,
if any. The archive is located in the position referenced by
_array_data.location_uri.

;

_item.name                   '_array_data.external_archive_format'
_item.category_id            array_data
_item.mandatory_code         no
_item.type_code              code
_item_default.value          .

loop_
_item_enumeration.value
_item_enumeration.detail
ZIP           'A ZIP archive'
TGZ           'A Gzipped tar archive'
TBZ           'A Bzip2 tar archive'
.             'No compressed archive is present'

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/COMCIFS/imgCIF/issues/7#issuecomment-817521786, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABB6EALB757GNGLVDMMZ453TIKG6ZANCNFSM4G3W25TA .

jamesrhester commented 3 years ago
  1. HDF5 virtual datasets. My understanding is that a virtual dataset is intended to look from the outside like a single file. This single file would be the object that is referenced by _array_data.location, and _array_data.external_path would be an HDF5 path in this virtual dataset. The HDF5 libraries would be responsible for taking this path and retrieving the appropriate data. Is this understanding incorrect?
  2. Specifying individual frames. Note _array_data.external_frame which is an integer. Is this insufficient, given an HDF5 path recognised by HDF5 libraries, to specify a particular 2D image?
  3. Using array_section. The idea is that whatever data is referenced by the new tags would then be subject to any and all transformations specified in the remainder of the imgCIF file, that is, the data referenced by the new tags fulfills the same role as _array_data.data . So in particular, any information in array_section is relevant.
  4. Image rasters. If a particular frame can be specified using _array_data.external_frame then the images can be reordered according to whatever the imgCIF metadata specifies. Hopefully I've understood the issue correctly.
yayahjb commented 3 years ago
  1. Yes the HDF5 libraries are responsible for most of the work, but for this purpose you would need to specify what frame, where, is intended to be the one being picked up. The path alone is not sufficient.
  2. The frame number itself is not meaningful without the specifics of what hyperslab within the given array you intend. All you know from the path is the overall array.
  3. That is just what array_section is about, but in this case we have two levels of sections to deal with the sections within the NeXus array specified by your path in the NeXus file which is then the array within which we need to specify sections on the CBF side. With something as complex as CSPAD data, waving our hands and telling people to deal with it without clearly worked out examples will lead to major confusion and dialects.
  4. The raster layout is major metadata. We need to nail it down.

We need a meeting on this that includes at least Aaron Brewster, Graeme Winter, you and me to flesh out your proposal and give examples of the use cases.

jamesrhester commented 3 years ago

I think we are talking about different things. I am purely interested in a way to identify a particular frame in an HDF5 file. If that HDF5 file is a VDS, then that frame is monolithic and the details of its assembly are irrelevant. If the details of the frame's assembly are actually relevant, then the path to the constituent HDF5 files is provided instead and the metadata for assembly into the final frame is described using the usual imgCIF tags.

  1. Yes the HDF5 libraries are responsible for most of the work, but for this purpose you would need to specify what frame, where, is intended to be the one being picked up. The path alone is not sufficient.

As I said, the path plus frame number should be sufficient. Section 5.3 and 5.5 on page 22 of the document you linked state that access to a virtual dataset looks just like access to any other HDF5 dataset. The only new options are those given in 5.5.1, 5.5.2 and 5.5.3, for which reasonable defaults can be debated and added to the description of HDF5 format above.

  1. The frame number itself is not meaningful without the specifics of what hyperslab within the given array you intend. All you know from the path is the overall array.

But the path + frame number is enough as per pg 22 referenced above. If you did for some reason want to recreate the HDF5 reconstruction, then you would instead reference the component HDF5 files (not the VDS) and you are once again in good shape with filename + HDF5 path + frame number together with imgCIF tags in array_section et. al.

  1. That is just what array_section is about, but in this case we have two levels of sections to deal with the sections within the NeXus array specified by your path in the NeXus file which is then the array within which we need to specify sections on the CBF side. With something as complex as CSPAD data, waving our hands and telling people to deal with it without clearly worked out examples will lead to major confusion and dialects.

No. Either the image delivered by the VDS is considered monolithic, and therefore HDF5 path + frame number for the VDS is sufficient, or else it is seen as being assembled out of sections, in which case you use the constituent HDF5 files + internal paths + frame numbers and duplicate the HDF5 VDS assembly using imgCIF descriptors, your choice. No handwaving is going on, but we could add more text to the HDF5 description making explicit the point that HDF5 VDS frames are considered monolithic, and if you want to reference constituent sections then you should reference the particular HDF5 files.

  1. The raster layout is major metadata. We need to nail it down.

Not in this proposal, as explained above, but if there is a gap in imgCIF metadata in e.g. array_section a separate issue can be opened.

We need a meeting on this that includes at least Aaron Brewster, Graeme Winter, you and me to flesh out your proposal and give examples of the use cases.

I note the HDF5 VDS document you linked has specific examples from DLS. Happy to meet, but before taking up everyone's time with a meeting, I suggest pointing Graeme and Aaron to this comment so that they can identify any errors, and if I am indeed being too simplistic then let's meet. As always, having these technical conversations in writing here or on the imgCIF list leaves a more permanent record and gives more time to think than a meeting, which is why I prefer them.

yayahjb commented 3 years ago

I agree we are talking about different things. You are talking about an abstract concept with which I agree, while I am talking about the practical realities of using that abstract concept to actually process data, which needs additional supporting tags to actually work. You and I could go back and forth for a very long time and come to something the might work, or we could bring in some of our colleagues, Aaron and Graeme, with practical experience at working with these issues and come to a really good set of definitions to resolve this much faster. I am glad you are open to bringing them in. I will send them an email.

jamesrhester commented 2 years ago

Closing as the relevant tags are now in the dictionary.