HenrikBengtsson / R.matlab

R package: R.matlab
https://cran.r-project.org/package=R.matlab
86 stars 25 forks source link

WISH: Give informative error message when trying to read MAT v7.3 files #23

Closed HenrikBengtsson closed 8 years ago

HenrikBengtsson commented 9 years ago

(Extracted from Issue #20)

Feature

Before even thinking about having readMat() support reading MAT v7.3 files, which are based on the HDF5 format, the function should at least recognize that those files are indeed in the MAT v7.3 file format and give an informative error message that the format is not yet supported.

Idea

The above requires updating the code that parses the beginning of MAT files to decide what MAT file format version the file has. See https://www.hdfgroup.org/HDF5/doc/H5.format.html#Superblock for HDF5 file signatures.

Requirement

Neither the R.matlab package nor any of its required dependencies (under Depends and Imports) requires compilation of native (C, Fortran, ...) code. This makes it particularly easy to install and use R.matlab for everyone, e.g. there's no need for having proper native libraries installed or complete tool chains if installing from source. Since I've seen so many struggle with missing libraries etc, I'd like to keep it this way as far as possible. (Note that packages under Suggests are optional and does therefore not have to be easy to install).

Design / implementation

Thus, in order to meet this requirement, the ability to detect MAT v7.3 files cannot rely on a package that needs native-code compilations. Because of this, it is unfortunately not possible to make use of, h5::is.h5file() to test whether the file is of the MAT v7.3/HDF5 file format, because the h5 package builds from C++ code (on top off Rcpp).

Instead, an idea is to write isMat7_3() function in pure R that use some "good-enough" heuristics for deciding whether a file is the MAT v7.3 file format or not. Only after it identify a file to be of the format, then it should require(h5) or similar (by use use(h5) of R.utils, the package is installed automagically). This way, R.matlab works for everyone even if their setup does not allow native-code compilations.

An alternative strategy is to write isMat5or6() functions for testing if a file is a MAT v5 or v6 file, in addition to the already existing isMat4() function to test for MAT v4 files. With this, we could postpose/lower the need for HDF5 readers, but doing something like:

   if (isMat4(...)) {
     readMat4(...)
   } else if (isMat5or6(...)) {
     readMat5(...)
   } else {
     readMat7_3(...)
   }

Compare this to today's:

   if (isMat4(...)) {
     readMat4(...)
   } else {
     readMat5(...)
   }
HenrikBengtsson commented 8 years ago

Information how to distinguish MAT v5 and MAT v7.3 files can be found at http://www.digitalpreservation.gov/formats/fdd/fdd000440.shtml. It seems that we need to parse the first eight bytes in order to tell the difference:

However, I'm not sure if this is the official way of distinguishing them.

The current approach we use is to read the first four bytes as a "magic" in order to distinguish MAT v4 and MAT v5 files. Thus, we need to extend this to at least eight bytes.

tbeu commented 8 years ago

You might want to check out C library matio for how to distinguish between v5 and v7.3 MAT files.

HenrikBengtsson commented 8 years ago

@tbeu, thanks for this. I'm curious, did you find that by reverse engineering or do you if it is documented anywhere?

tbeu commented 8 years ago

Only v5 is documented. Search for version or 0x0100 in http://www.mathworks.com/help/pdf_doc/matlab/matfile_format.pdf.

HenrikBengtsson commented 8 years ago

I've updated readMat() to detect MAT v7.3 files according to @tbeu's suggestion (thxs). When read such files, we now get an informative error message:

> library("R.matlab")
> readMat(system.file("mat-files", "Matrix-v7.3.mat", package="R.matlab"))
Error in readMat5(con, firstFourBytes = firstFourBytes, maxLength = maxLength) : 
  Reading of MAT v7.3 files is not supported. If possible, save the data in MATLAB using 'save -V6'.

Until released on CRAN, this develop version can be installed using:

> source("http://callr.org/install#HenrikBengtsson/R.matlab@develop")
HenrikBengtsson commented 8 years ago

R.matlab 3.6.1 implementing this (the assertion that the MAT file in not a MAT v7.3) is now on CRAN.

Sherakbar commented 5 years ago

hi.. i am also receiving such type of error message: Error in readMat5(con, firstFourBytes = firstFourBytes, maxLength = maxLength) : Reading of MAT v7.3 files is not supported. If possible, save the data in MATLAB using 'save -V6'. also i am using R.matlab version 3.6.2 but its not working for me...