rhdf5 - Githubissues

Overview

hdf5 is a file-based database used in scientific applications, including hsp2.
Rhdf5 - hdf5 access package in R
- part of Biocmanager ecosystem
- Package github page: https://github.com/grimbough/rhdf5
- Package user manual: http://www.bioconductor.org/packages/release/bioc/manuals/rhdf5/man/rhdf5.pdf

Installation

install.packages("BiocManager")
BiocManager::install("rhdf5")
library("rhdf5")
dsn411 = h5read("forA51800.h5","/TIMESERIES/TS411/table")

Exploring rhdf5 as a possible solution to finding timestamps in .h5 files

Url to already run .h5 file: http://deq1.bse.vt.edu:81/files/cbp/OR1_7700_7980.h5
First, navigating to directory: cd cbp_river_example
Downloading the file in Terminal: wget -N http://deq1.bse.vt.edu:81/files/cbp/OR1_7700_7980.h5 ~~wget -rN http://deq1.bse.vt.edu:81/files/cbp/OR1_7700_7980.h5~~ Note: the '-N' tells system to overwrite if incoming file is newer Note: do not use the -rN, use only -N because the r causes the folder structure to be copied as well (like files/cbp/OR_7700_7980.h5) and we don't want that.
Previewing group names: h5ls("OR1_7700_7980.h5", recursive = FALSE)
Listing with a true recursive argument will give a much longer output but include detailed file paths for further exploration: h5ls("OR1_7700_7980.h5")
For example: "/TIMESERIES/TS1001/_i_table/index"

Comparing h5read and h5ls commands:

From documentation:

h5ls("file", recursive = TRUE, all = FALSE, datasetinfo = TRUE, index_type, native = FALSE)

h5read("file", name, index = NULL, start = NULL, stride = NULL, block = NULL, count = NULL, compoundAsDataFrame = TRUE, callGeneric = TRUE, read.attributes = FALSE, drop = FALSE, native = FALSE, s3 = FALSE, s3credentials = NULL)

h5ls:

Outputs structure of the file
Set recursive=FALSE to only show group names

h5read:

Under 'name' argument is where group and datasets are called
Example of a memory-related error: read1 <- h5read("OR1_7700_7980.h5", name = "/TIMESERIES/TS1001", index = 10) - - Error in H5Dread... Not enough memory to read data! Try to read a subset of data by specifying the index or count parameter. Error: Error in h5checktype(). H5Identifier not valid.
This error might be telling me that TS1001 is, in fact, a dataset in the TIMESERIES group...
Looks like h5read will be most valuable to us for extracting usable data from the HDF5 files

h5read redirects through its arguments to h5Dread, which reads a partial dataset from an HGF5 file. Could this help with the error above?

h5Dread:

h5Dread(h5dataset, h5spaceFile = NULL, h5spaceMem = NULL, buf = NULL, compoundAsDataFrame = TRUE, bit64conversion, drop = FALSE
h5dataset = object of class h5IdComponent representing open HDF5 dataset

@juliabruneau - good information/hunting. Perhaps this is due to a partial set? Though, I think that the memory error comes from an unspecific query. For example, if I keep knocking things off the end of the h5read path like so I get more and more warnings, presumably as the data retrieved gets larger, I just think maybe this error could mean there is too much data?:

If I use the command line h5dump and grep search for "TIMESERIES/TS1001", i.e. h5dump -d OR1_7700_7980.h5 | grep TS1001 , I get an inkling of what the contents of that timeseries may be, and how to get at it, and wow are there a lot of components in that table. See below Terminal Code 1. It makes me think that I should try something more specific, by adding /table on the end of the path, I get actual data and no errors (but some warnings appear)
- ts_data = h5read("OR1_7700_7980.h5", "/TIMESERIES/TS1001/table")
- This gives me a data frame with real data!
- Try it out :).
- So now that we have data ... what IS TS1001 by the way?!?!? @juliabruneau let's add this to the #237 , even though we (read I) don't know what it is, it definitely HAS data so we sjhoudl find it out, and maybe we'll also determine if there is a system to the naming of these elements.

Terminal Code 1: Output of h5dump -d OR1_7700_7980.h5 | grep TS1001

 group      /TIMESERIES/TS1001
 group      /TIMESERIES/TS1001/_i_table
 group      /TIMESERIES/TS1001/_i_table/index
 dataset    /TIMESERIES/TS1001/_i_table/index/abounds
 dataset    /TIMESERIES/TS1001/_i_table/index/bounds
 dataset    /TIMESERIES/TS1001/_i_table/index/indices
 dataset    /TIMESERIES/TS1001/_i_table/index/indicesLR
 dataset    /TIMESERIES/TS1001/_i_table/index/mbounds
 dataset    /TIMESERIES/TS1001/_i_table/index/mranges
 dataset    /TIMESERIES/TS1001/_i_table/index/ranges
 dataset    /TIMESERIES/TS1001/_i_table/index/sorted
 dataset    /TIMESERIES/TS1001/_i_table/index/sortedLR
 dataset    /TIMESERIES/TS1001/_i_table/index/zbounds
 group      /TIMESERIES/TS1001/_i_table/values
 dataset    /TIMESERIES/TS1001/_i_table/values/abounds
 dataset    /TIMESERIES/TS1001/_i_table/values/bounds
 dataset    /TIMESERIES/TS1001/_i_table/values/indices
 dataset    /TIMESERIES/TS1001/_i_table/values/indicesLR
 dataset    /TIMESERIES/TS1001/_i_table/values/mbounds
 dataset    /TIMESERIES/TS1001/_i_table/values/mranges
 dataset    /TIMESERIES/TS1001/_i_table/values/ranges
 dataset    /TIMESERIES/TS1001/_i_table/values/sorted
 dataset    /TIMESERIES/TS1001/_i_table/values/sortedLR
 dataset    /TIMESERIES/TS1001/_i_table/values/zbounds
 dataset    /TIMESERIES/TS1001/table

Hey all - see below which is excerpted from the test cases that we worked on yesterday (see also #211). This one gets us data that we want and gives clues as to where to look for other data (hint: maybe not TIMESERIES)

rchres_data = h5read("OR1_7700_7980.h5", "/RESULTS/RCHRES_R001/HYDR/table")
names(rchres_data )
quantile(rchres_data$ROVOL)

Looking further into rchres_data, the column 'index' is the only one without numerical data... all values are NA. Maybe because it was a date/timestamp before was read by h5read? -> Assumption was wrong
Going to try and see if h5ls or h5dump tells us anything more
@glenncampagna: gotcha, good checking. See here for the h5dump from the command line - it might be useful in figuring out why the timestamp is NULL in R (it's NOT null in terminal): https://github.com/HARPgroup/HARParchive/issues/235#issuecomment-1164778261
If you do the command line test that I suggested, the first piece of actual data that comes out has a timestamp of "441766800000000000", but R cannot make sense of that. Because in R:
- if I do as.POSIXct(441766800000000000,origin="1970-01-01", tz="UTC") to convert Unix seconds into a readable time, I get: [1] NA
- If I cut off 1 zero from the end: as.POSIXct(44176680000000000,origin="1970-01-01", tz="UTC") = "1399905230-08-14 16:00:00 UTC"... a seriously post-StarTrek date,but at least it's a date.
- If I cut off 5 zeroes:
```
as.POSIXct(4417668000000,origin="1970-01-01", tz="UTC")
[1] "141960-04-29 16:00:00 UTC"
```
- If I remove 9 zeroes, it makes sense!:
```
as.POSIXct(441766800,origin="1970-01-01", tz="UTC")
[1] "1984-01-01 01:00:00 UTC"
```
- But, am I missing something? I think:
- 44176680000000000 is too long
- We maybe should try dividing that number by 1 billion (9 zeroes) because if there is some sort of fractions of a second, we are maybe OK.
- Either way this tells me that if we are in Terminal, h5dump still has a problem with timestamp, just a different problems than R.

HDFView 3.1.4

We can explore the .h5 files with an application called HDFView . It is used to specifically open .hdf5/.h5 files, and it provides a directory to look into the different groups and attributes within the .h5 file. The only "limitation" is that you have to register to this website in order to download the application, but it only asks for your email and what organization you're apart of (academic research).

This is the process to access the files in HDFView:

Download the application: https://www.hdfgroup.org/downloads/hdfview/?1656346198
- Download the .zip file: 'HDFView-3.1.4-win10_64-vs16.zip'
- Extract the .zip file with something like 7-zip
Download the .h5 file: http://deq1.bse.vt.edu:81/files/cbp/OR1_7700_7980.h5
- Do this by right-clicking on the link, and then choosing: 'Save link as...'
- If download discards, click up arrow and hit keep
Click on the downloaded .h5 file to open it (this will open it automatically in the HDFViewer)
Now you are able to see all the groups and different "layers" in our .h5 file

Ultimately, you're able to click on 'Show Data with Options', which will provide another window with a table, and you can extract the table as a text file (shown below)

This Viewer provides more understanding on the contents of a hdf5 file, and it can hopefully help understand how we can extract the timestamp using R. Maybe we can utilize the Viewer's function to extract .txt files?

Update: Can save table as a .txt file to computer. Working on putting into R.

Using H5Dread to get 64-bit timestamps:

fid = H5Fopen("OR1_7700_7980.h5")
did = H5Dopen(fid, "RESULTS/RCHRES_R001/HYDR/table")
H5Dread(did, bit64conversion= "double")
 index       DEP      IVOL O1 O2         O3 OVOL1 OVOL2      OVOL3
1    4.417668e+17 0.2415072  8.847264  0  0   2.055183     0     0  0.1229398
2    4.417704e+17 0.3002159  8.828485  0  0   3.175832     0     0  0.2161576
3    4.417740e+17 0.3486096  8.810900  0  0   4.282218     0     0  0.3081839
4    4.417776e+17 0.3905506  8.793962  0  0   5.374578     0     0  0.3990412
5    4.417812e+17 0.4279465  8.777404  0  0   6.453111     0     0  0.4887475
6    4.417848e+17 0.4619085  8.761090  0  0   7.517995     0     0  0.5773184
7    4.417884e+17 0.4931512  8.744946  0  0   8.569401     0     0  0.6647684
8    4.417920e+17 0.5221504  8.728931  0  0   9.606858     0     0  0.7510851
9    4.417956e+17 0.5492472  8.713009  0  0  10.629819     0     0  0.8362263
10   4.417992e+17 0.5747209  8.697166  0  0  11.638691     0     0  0.9201863

Don't forget to close the open data objects, both the file and dataset, when finished

H5Dclose(did)
H5Fclose(fid)

We should check for possible data loss in conversion process from 64-bit to a double, since R doesn't support 64-bit

Checking the timestamps by conversion to date-times in R, it doesn't appear there was data loss which might appear as repeated timestamps:


rchres1 <- H5Dread(did, bit64conversion= "double")
head(rchres1)
     index       DEP     IVOL O1 O2       O3 OVOL1 OVOL2     OVOL3 PRSUPY
1 4.417668e+17 0.2415072 8.847264  0  0 2.055183     0     0 0.1229398      0
2 4.417704e+17 0.3002159 8.828485  0  0 3.175832     0     0 0.2161576      0
3 4.417740e+17 0.3486096 8.810900  0  0 4.282218     0     0 0.3081839      0
4 4.417776e+17 0.3905506 8.793962  0  0 5.374578     0     0 0.3990412      0
5 4.417812e+17 0.4279465 8.777404  0  0 6.453111     0     0 0.4887475      0
6 4.417848e+17 0.4619085 8.761090  0  0 7.517995     0     0 0.5773184      0
    RO     ROVOL     SAREA        TAU     USTAR      VOL VOLEV
1 2.055183 0.1229398  67.47366 0.02344255 0.1099862 15.79433     0
2 3.175832 0.2161576  83.87602 0.02914127 0.1226281 24.40666     0
3 4.282218 0.3081839  97.39652 0.03383873 0.1321425 32.90938     0
4 5.374578 0.3990412 109.11423 0.03790982 0.1398658 41.30430     0
5 6.453111 0.4887475 119.56212 0.04153978 0.1464090 49.59296     0
6 7.517995 0.5773184 129.05061 0.04483640 0.1521076 57.77673     0

origin <- "1970-01-01" rchres1$index <- as.POSIXct((rchres1$index)/1000000000, origin = origin, tz="UTC")

head(rchres1) index DEP IVOL O1 O2 O3 OVOL1 OVOL2 OVOL3 1 1984-01-01 01:00:00 0.2415072 8.847264 0 0 2.055183 0 0 0.1229398 2 1984-01-01 02:00:00 0.3002159 8.828485 0 0 3.175832 0 0 0.2161576 3 1984-01-01 03:00:00 0.3486096 8.810900 0 0 4.282218 0 0 0.3081839 4 1984-01-01 04:00:00 0.3905506 8.793962 0 0 5.374578 0 0 0.3990412 5 1984-01-01 05:00:00 0.4279465 8.777404 0 0 6.453111 0 0 0.4887475 6 1984-01-01 06:00:00 0.4619085 8.761090 0 0 7.517995 0 0 0.5773184 PRSUPY RO ROVOL SAREA TAU USTAR VOL VOLEV 1 0 2.055183 0.1229398 67.47366 0.02344255 0.1099862 15.79433 0 2 0 3.175832 0.2161576 83.87602 0.02914127 0.1226281 24.40666 0 3 0 4.282218 0.3081839 97.39652 0.03383873 0.1321425 32.90938 0 4 0 5.374578 0.3990412 109.11423 0.03790982 0.1398658 41.30430 0 5 0 6.453111 0.4887475 119.56212 0.04153978 0.1464090 49.59296 0 6 0 7.517995 0.5773184 129.05061 0.04483640 0.1521076 57.77673 0
Note: We found that this table's last timestamp is 1984-09-02 02:00:00

HDFView 3.1.4

We can explore the .h5 files with an application called HDFView . It is used to specifically open .hdf5/.h5 files, and it provides a directory to look into the different groups and attributes within the .h5 file. The only "limitation" is that you have to register to this website in order to download the application, but it only asks for your email and what organization you're apart of (academic research).

This is the process to access the files in HDFView:

Download the application: https://www.hdfgroup.org/downloads/hdfview/?1656346198

Download the .zip file: 'HDFView-3.1.4-win10_64-vs16.zip'

Extract the .zip file with something like 7-zip

Download the .h5 file: http://deq1.bse.vt.edu:81/files/cbp/OR1_7700_7980.h5

Do this by right-clicking on the link, and then choosing: 'Save link as...'

If download discards, click up arrow and hit keep

Click on the downloaded .h5 file to open it (this will open it automatically in the HDFViewer)

Now you are able to see all the groups and different "layers" in our .h5 file

Ultimately, you're able to click on 'Show Data with Options', which will provide another window with a table, and you can extract the table as a text file (shown below)

This Viewer provides more understanding on the contents of a hdf5 file, and it can hopefully help understand how we can extract the timestamp using R. Maybe we can utilize the Viewer's function to extract .txt files?

Update: Can save table as a .txt file to computer. Working on putting into R.

Attempting to View Output .h5 Files in the HDFView Program:

After running the land test case (https://github.com/HARPgroup/HARParchive/issues/211), I thought it would be easier to explore and compare the river and land model outputs if we were to open them in the viewer.

It would allow us to view the groups, subgroups, and data tables by clicking on them as folders rather than repeatedly running commands.

However, to open them in the viewer they need to be downloaded to your local computer. Since we do not have a link on Github as we did for the first .h5 file, this became an issue.

*Command to copy files from server to local computer (found online): scp user@server:/path/to/remotefile.zip /Local/Target/Destination

*For multiple files at once: scp user@host:/remote/path/\{file1.zip,file2.zip\} /Local/Path/

- With this, I struggled with how to reference my local computer because calling the local disk "C:" means nothing to the server
- These are the main things I tried:
-  `scp megpritch@deq2:~/OR1_7700_7980.h5 ~/Desktop/`
    - recieved:    _/home/megpritch/Desktop/: Is a directory_
    - Not sure what to do with this information because it doesn't seem to be an error
- ` scp megpritch@deq2:~/\{OR1_7700_7980.h5,forA51800.h5\} ip_address/Desktop/folder_name`
    - recieved:   _No such file or directory_
- `/home/megpritch/OR1_7700_7980.h5 ~\folder_name\OR1_7700_7980.h5`
    - This said that it downloaded successfully but then I couldn't find it on my computer. Turns out it made a copy of the file inside my deq account's home directory but renamed it "folder_nameOR1_7700_7980.h5"

What Now?

Ideally, since the files are so large anyways it would be great to know how to create the access link in github as was done for the first .h5 file.
Then the file can be saved to our computers as that link, which HDFView is able to open the file with. This requires less space from our personal computers.
Or, using this viewer can be put on the backburner.

Still be useful since if we simply want to use it to help us understand structure better, we only really need to have one H5 file for Rivers, and another H5 file for land as a template. That is because each land H5 file will share an identical structures to each other land H5 file. Similarly the same will apply for river h5s. I like that you tried scp, and seems like you got close. We can find a more efficient solution tomorrow.

HARPgroup / HARParchive

rhdf5 #207

Overview

Installation

Exploring rhdf5 as a possible solution to finding timestamps in .h5 files

Comparing h5read and h5ls commands:

HDFView 3.1.4

HDFView 3.1.4

Attempting to View Output .h5 Files in the HDFView Program: