ess-dmsc / h5cpp

C++ wrapper for the HDF5 C-library
https://ess-dmsc.github.io/h5cpp/
GNU Lesser General Public License v2.1
118 stars 19 forks source link

Improvements in resizing of datasets #284

Closed martukas closed 6 years ago

martukas commented 6 years ago

I see now that my previous wrapper had already integrated automatic resizing of datasets when writing appropriate hyperslabs. Basically, if the requested hyperslab is within the maximum dimensions but not within current dimensions, it would extend the dataset to fit the newly written data. In our current implementation I see that this must be done manually.

Furthermore, while the Dataset::extent(Dimensions) function makes sense, the Dataset::extent(size_t dim, ssize_t delta_elements) one is a bit perplexing. Why delta and not the total extent of the dimension? It is not clear that delta will always be convenient. And the implementation of this function looks a bit bulky, so it makes me think there is more going on here and is a good candidate for refactoring. Perhaps we should think about a few different ways clients would want to use this feature.

So, my suggestions are as follows:

  1. Provide a few different functions for extending datasets, perhaps using delta and absolute values. Maybe bring some more people into this discussion to see what needs are out there?

  2. It probably doesn't make sense to resize automatically every time, but again, some added convenience wouldn't hurt. For example, I could imagine a function called accommodate, which, given a dataspace selection, would check if any of its bounds go beyond the current size (but still within maximum size), and resize the dataset accordingly.

  3. As a an adjacent issue, I might also suggest changing the name of the extent function to resize. I am again looking at h5py and I think they have the right idea. We already have "dataspace" and "dimensions" and I think "rank" is somewhere in there as well. I would aim for less terminology unless it's totally needed. Most people think of this as resizing and honestly, so do we, so I think the h5py folks have it right here.

martukas commented 6 years ago

I need a broader sampling of opinions on this @SkyToGround @matthew-d-jones @dominikwerder

SkyToGround commented 6 years ago

My opinion on then questions you have are as follows:

  1. Having the functionality for changing the size of a dataset by delta is not something that should exist in a library such as this as I think it is a rare use case in the grand scheme of things and because there might be corner cases in how this is done. Although we need this functionality for what we are doing, it is easy for us to implement it in a way that is tailored specifically to our needs. E.g. extending a dataset by passing a std::array or similar to the relevant function.

  2. Similar to what is mentioned above, I do not think this should be in the h5cpp library. Although it can be a nice feature to be able to automatically change the size of a dataset based on where you write the data, my experience is that this is a very dangerous feature that has caused a number of serious bugs (see Matlab). If you really want this feature anyway, it is relatively easy to implement yourself.

  3. This is where I agree with you, the extent() function has a confusing name. Changing this to resize() would be an improvement.

martukas commented 6 years ago

I take them in reverse order.

Glad we agree on #3.

I'm willing to concede on #2 for now. However, it's one of those things that if you do use extendable data sets, you will be growing them every time you write data. And if you do, and you also like clean code, you will soon write helper functions for this. So, in the end anyone that uses extendable data sets will end up writing another wrapper to do the same thing, which is what we set out to avoid with this... It could be implemented in the library, but only if it was made an explicit option, so that it doesn't go about resizing them by default when you don't expect it. So, this could be an additional flag or parameter, in a similar vein that there is a property you can supply for group and dataset creation to create intermediate groups in a complex path. I guess this could wait for when we have a better interface for property lists. But in the long run I think it would make sense to implement this auto-extending in one place, not make everyone else do it as if there was any purpose to such "homework".

As for #1, I'm not sure we are talking about the same stuff here... We need to have an interface. And interface that that makes sense. And an interface that is useful. Do you resize the dataset by providing it the new total size, or by providing it some delta number of elements it should be grown by? Which is the default way to do it? Right now it is the latter. I argue that it is neither obvious nor as useful. Maybe it is useful to someone? Maybe it is useful enough that we should have both ways of doing it. There is nothing wrong with that and implementing it will take marginally more time than discussing it. But I'd like to survey the rest of you guys to see what you find useful. From your answer I didn't understand which way of doing it made more sense, only that I should write as little wrapper as possible so that we can all happily do more boilerplate in our individual projects ;) Look at the Hyperslab class. There are a number of common ways that it can be used. Many ways. And there is nothing wrong with that. You have IDE auto-completion to tell you what the possible signatures are that you can use. No confusion :)

yuelongyu commented 6 years ago

Just some comments. Maybe not fully related to the title. I am mainly a user of the library. From my point of view, during writing the data, using the extent() and offset() methods to locate the position before data writing is a bit confusing. I would prefer them to be hidden behind the write() method if possible. If extent() is needed, the library should do the job, not the user.

Another thing I would like to mention is passing the dataspace when writing the data. In fact, I don't quite understand why dataspace is needed before writing the data, because the dataspace is already decided when creating the dataset. If the data needs to be written doesn't match, then it is simply an error.

eugenwintersberger commented 6 years ago

I think introducing a new method resize and declaring extent as deprecated is a good idea. I also agree upon the rather strange semantics concerning deltas. However, the functionality is useful. So I would suggest adding a free standing function something like

void resize_by(Dataset dataset, size_t dimension_index, ssize_t delta)
eugenwintersberger commented 6 years ago

@yuelongyu the problem is not as easy. During IO operations the dataspace passed to the IO function/method describes the particular portion of the dataset where to read or write the data. This is why you have to pass the dataspace to the method/function. Concerning your totally understandable critics to extent and offset: I hope we can fix this with the resize methods and the new (not entirely implemented) view classes.