ComputationalRadiationPhysics / libSplash

libSplash - Simple Parallel file output Library for Accumulating Simulation data using Hdf5
GNU Lesser General Public License v3.0
15 stars 15 forks source link

Attribute _size is of type uint64 #266

Closed PrometheusPi closed 6 years ago

PrometheusPi commented 6 years ago

Various attributes are given as unit64 type. This might be slightly more memory efficient but causes a lot of trouble when using the common openPMD data analysis tools, because e.g. when loading _size to perform a data slicing via integer division as

Ny = f[...].attrs['_size'][1]
data_slice = f[...][:, Ny//2, :]

will fail because a integer division by a uint and an int is just a floating point division in python.

I would vote for changing theses types to ints. Any other suggestions how to solve this?

PrometheusPi commented 6 years ago

cc @steindev and @alex-koe

ax3l commented 6 years ago

@PrometheusPi the attributes with _ prefix (such as _size) are not openPMD and are only in libSplash files for legacy reasons. please use the according openPMD attributes instead.

For example, _size does not exist in openPMD - just use .shape of the data set in python (it is an int!).

Decisions for unsigned vs. signed are usually not done due to memory constrains but due to definition range. A size can never be negative for example.

Indeed, uint-int arithmetics is often weird in python, since its automatically trying to cast up to a "more precise" type on mixed type math. In any case, you can always cast your access to the array indices:

data_slice = f[...][:, int(Ny)//2, :]
ax3l commented 6 years ago

Python 2.7 and 3.4 possibilities for the numpy cast handling:

import numpy as np

np.uint(3)//2
# 1.0
# correct int division, just upcasted to float (which numpy indexes do not like)

np.uint(3)/2
# 1.5
# proper float division (which numpy indexes do not like)

np.floor_divide(np.uint(3), 2, dtype=np.int)
# 1
# proper numpy mixed int math

N = f[...].attrs['_size'].astype(np.int64)
Ny = N[1]
data_slice = f[...][:, Ny//2, :]
# convert on read

I usually prefer int(Ny)//2 or if necessary the last method. This is still "just" a Numpy specific thing and not really a question of the actually stored data attribute.

Anyway, for your specific question: use .shape of the numpy ndarray you read:

Ny = f[...].shape[1]
data_slice = f[...][:, Ny//2, :]
PrometheusPi commented 6 years ago

Further investigations showed that only the np.uint64 data type is effected. I opened an issue at the numpy repo.

This seems to be a python issue - I will thus close this issue here.

PrometheusPi commented 6 years ago

Update: In numpy, casting to float is intentional at this point. Since uint has a slightly higher range, a cast to int might lead to errors. Thus unitX // intX will always return int(X*2). However, since unit64 is the largest rang available, there a cast to float is favored to avoid possible errors.

Thus, numpy will not change this behavior.

PrometheusPi commented 6 years ago

The optimal solution is thus:

Ny // np.uint(2)
ax3l commented 6 years ago

Thx for asking upstream!

Actually the decision of going to float for a slightly larger range (2x) is bought by getting less precision due to the mantissa in floats. This can only be counter-acted again by going to really large floats, which is memory and speed costly when starting from a (u)int64...