geospace-code / h5fortran

Lightweight HDF5 polymorphic Fortran: h5write() h5read()
https://geospace-code.github.io/h5fortran
BSD 3-Clause "New" or "Revised" License
99 stars 24 forks source link

Can h5fortran read UTF-8 string attributes? #29

Closed milancurcic closed 2 years ago

milancurcic commented 2 years ago

HDF5 n00b here. I'm able to write a simple HDF5 file with a global attribute and read the attribute back from it, e.g.:

  use h5fortran, only : hdf5_file
  implicit none

  type(hdf5_file) :: h5f
  character(100) :: attrval

  call h5f % open('test_file.h5', action='w')
  call h5f % writeattr('/', 'greeting', 'hello')
  call h5f % close()

  call h5f % open('test_file.h5', action='r')
  call h5f % readattr('/', 'greeting', attrval)
  call h5f % close()

  print *, attrval

I get the output that I expect.

Then, I'm trying to read a global attribute from a file output by Keras (attached). I use the same approach:

  character(100) :: attrval

  call h5f % open('mnist_dense.h5', action='r')
  call h5f % readattr('/', 'model_config', attrval)
  call h5f % close()

  print *, attrval

However, the output is not what I expect:

 ��Y                                                                             

(and similar; it varies between runs).

In an attempt to understand why, I used ncdump and h5dump to inspect the files. From the simple test_file.h5 I created, I have:

$ ncdump -h test_file.h5 
netcdf test_file {

// global attributes:
        :greeting = "hello" ;
}
$ h5dump test_file.h5 
HDF5 "test_file.h5" {
GROUP "/" {
   ATTRIBUTE "greeting" {
      DATATYPE  H5T_STRING {
         STRSIZE 6;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_ASCII;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "hello"
      }
   }
}
}

And for the Keras-generated HDF5 file:

$ ncdump -h test/data/mnist_dense.h5
netcdf mnist_dense {

// global attributes:
        string :keras_version = "2.9.0" ;
        string :backend = "tensorflow" ;
        string :model_config = "..."
trimmed for brevity

$ h5dump mnist_dense.h5 
HDF5 "mnist_dense.h5" {
GROUP "/" {
   ATTRIBUTE "backend" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "tensorflow"
      }
   }
trimmed for brevity

Comparing the two h5dump outputs, I can see that the attribute types are different in terms of STRSIZE (6 vs. H5T_VARIABLE) and CSET (H5T_CSET_ASCII vs H5T_CSET_UTF8).

What do you think about this? It seems to me that the different encoding (ASCII vs UTF8) could be the culprit for may failed reading of the Keras file. Does h5fortran support this, and if yes, how should I do the reading?

Thanks!

Attachment (gzipped so GitHub lets me upload it): mnist_dense.tar.gz

milancurcic commented 2 years ago

Could it be possible that the issue is with STRSIZE being variable rather than fixed?

scivision commented 2 years ago

I need to add tests for this and fix it. Until recent updates, h5fortran character support in general was a bit narrow in scope. h5py discusses UTF8 and ASCII in HDF5 files, and h5py defaults to UTF8. In Fortran, some compilers including Intel oneAPI 2022 do not yet support UTF8 character. So I can't just make h5fortran default to UTF8.

Several days ago I added the ability to read variable length string datasets, but this might be missing from attributes, that should be an easy fix.

So in short there are two possible issues:

  1. attributes might need to have variable length string added just like datasets already have in h5fortran
  2. may need to add capability to write/read UTF8 with HDF5, provided the compiler supports it
scivision commented 2 years ago

Recent updates to character in h5fortran (several days to today)

I didn't make any changes to attributes for these feature updates, and obviously character is even more popular for attributes than datasets, so this is worthwhile.

milancurcic commented 2 years ago

Thanks, Michael. In case it's helpful, I found this thread and example: https://forum.hdfgroup.org/t/how-to-read-a-utf-8-string/6125/6

milancurcic commented 2 years ago

I made a bit of headway by adapting the example above. I couldn't find a subroutine in the API to query the length of the string (let me know if you know about it), so it currently hardcodes a possibly sufficiently large buffer length.

program p

  implicit none

  print *, get_h5_attribute_string('mnist_dense.h5', '.', 'model_config') 

contains

  function get_h5_attribute_string(filename, object_name, attribute_name) result(res)
    use hdf5, only: H5F_ACC_RDONLY_F, HID_T, &
                    h5aget_type_f, h5aopen_by_name_f, h5aread_f, &
                    h5fclose_f, h5fopen_f
    use iso_c_binding, only: c_char, c_f_pointer, c_loc, c_null_char, c_ptr

    character(*), intent(in) :: filename
    character(*), intent(in) :: object_name
    character(*), intent(in) :: attribute_name
    character(:), allocatable :: res

    ! Make sufficiently large to hold most attributes
    integer, parameter :: BUFLEN = 10000

    type(c_ptr) :: f_ptr
    type(c_ptr), target :: buffer
    character(len=BUFLEN, kind=c_char), pointer :: string => null()
    integer(HID_T) :: fid, aid, atype
    integer :: hdferr

    ! Open the file and get the type of the attribute
    call h5fopen_f(filename, H5F_ACC_RDONLY_F, fid, hdferr)
    call h5aopen_by_name_f(fid, object_name, attribute_name, aid, hdferr)
    call h5aget_type_f(aid, atype, hdferr)

    ! Read the data
    f_ptr = c_loc(buffer)
    call h5aread_f(aid, atype, f_ptr, hdferr)
    call c_f_pointer(buffer, string)

    ! Close the file 
    call h5fclose_f(fid, hdferr)

    res = string(:index(string, c_null_char))

  end function get_h5_attribute_string

end program p

Building and running the program on the h5 file attached above in this thread returns the expected output:

 {"class_name": "Sequential", "config": {"name": "sequential", "layers": [{"class_name": "InputLayer", "config": {"batch_input_shape": [null, 784], "dtype": "float32", "sparse": false, "ragged": false, "name": "input_1"}}, {"class_name": "Dense", "config": {"name": "dense", "trainable": true, "dtype": "float32", "units": 30, "activation": "sigmoid", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "Zeros", "config": {}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}}, {"class_name": "Dense", "config": {"name": "dense_1", "trainable": true, "dtype": "float32", "units": 10, "activation": "softmax", "use_bias": true, "kernel_initializer": {"class_name": "GlorotUniform", "config": {"seed": null}}, "bias_initializer": {"class_name": "Zeros", "config": {}}, "kernel_regularizer": null, "bias_regularizer": null, "activity_regularizer": null, "kernel_constraint": null, "bias_constraint": null}}]}}
scivision commented 2 years ago

Note, this UTF8 read is done using the default character Fortran kind. I haven't examined the consequences of this vs. the UCS4 Fortran kind. I note in both cases, len=4 character is required.

My test of this is trivial, so please reopen this if this doesn't work for you.