MannLabs / alphabase

Infrastructure of AlphaX ecosystem
https://alphabase.readthedocs.io
Apache License 2.0
31 stars 9 forks source link

Io #4

Closed swillems closed 3 years ago

swillems commented 3 years ago

HDF python code that is fully functional. It was developed in python directly, the notebook is a stub with minimal testing

swillems commented 3 years ago

Overall extremely cool stuff. Tested code and notebook on Mac with no problems. Can now understand why this was merged asap :D

Potential Bugs:

  • A single string (hdf_file.a = '123') will not show up in hdf_file.components

The components should probably not be used too much and is more of a conveniene internal function. The example as you showed here is part of hdf.metadata, where it will show up as a key, value pair in a dict.

  • Numpy string arrays are only supported when in object-mode hdf_file.b = np.array(['ABC', 'DEF'], dtype='object') -> works hdf_file.b = np.array(['ABC', 'DEF'])-> TypeError (TypeError: No conversion path for dtype: dtype('<U3')) Maybe we could include that it should be object mode when raising the Error so people can fix it themselves?

This can indeed be somewhat annoying. There is some outcommented (pirate) code on lines 230 - 238 that is related to this issue. However, this is somewhat difficult to control, because the original dtype is lost (note that HDF5 can simply not store all string dtypes that are supported by numpy, and even less than are supported by pandas). While this sacrifices some genericity, it might not necessarily be bad to enforce some uniformity when multiple people are working with multiple codebases. That said, the error message is indeed somewhat confusing, I'll update this and make it clearer for the user.

Minor suggestion for usability: When using a not-supported type like list one gets an NotImplementedError: Type '<class 'list'>' is invalid for attribute a_list -error. Maybe also state which types are supported?

Done. At the moment it overwrites thes old message, you have a suggestion to pass multiple message easier?

Truncate: Are there any use cases where we wouldn't want to truncate? (i.e set default=True)

Simple use case is where an analysis was done partially. E.g. array1 is created, the file crashed on array2, and array3 is not made yet. While you can check for this off course, Python's "It's better to ask forgiveness than permission" paradigm can now be applied to "auto-continue" partial workflows.

Minor suggestion for documentation: Minimal usage example - there is the test but maybe this is too much for starters.

Suggestion:

#Create a new hdf file and write data to it
hdf_file = alphabase.io.hdf.HDF_File('test.hdf', read_only = False)
hdf_file.a = np.array([1,2,3])
hdf_file.b = np.array([4,5,6])

#Read data, show components and access data:
hdf_file = alphabase.io.hdf.HDF_File('test.hdf')
print(hdf_file.components)
print(hdf_file.a.values)

Briefly copy-pasted it to the hdf notebook, which chouls be greatly expanded for god documentation and testing.

Above the line from the other PR:

Some thoughts on the HDF - just for brainstorming / discussion?

Do you think we will ever need something like SWMR ? For a case where multiple processes work on the same hdf or do you think we will have a workaround?

I considered this, but this is fairly tricky. Mostly because if thread1 writes something new, the Python object wrapper in thread2 is not automatically updated. Multiple reading would be a good idea, but I am not sure how well this is supported. If you have more experience with it, I'd be happy to get a gentle introduction.

Dataset storage layout Contiguous / Chunked. We use the default Contiguous, and I think this is the way to go (especially e.g. if we should have some live applications at some point where we stream the data). Are there any arguments for chunked storage layout that we might have missed?

In the past we did Contiguous, Now everything is Chunked (and compressed) by default. This is done primarily because we need to be able to append to arrays/dataframes and this can only be done with chunked storage. I have yet to assess the performance penalty, but this is likely to be very dependent on what kind of array/dataframe is passed. It would be possible to make some context managers that actually allow us to control how we store the data, but I am not sure if this is an overkill.

Memory mapping: I once saw this gist for memory mapping directly on HDF - this could be quite cool when wanting to have some low-memory footprint applications. Do you think it would make sense to include a memory-map reading mode? Or is this too specific?

HDFs are low-memory by default. The gist you saw is somewhat useful to make sure there are less dangling files (not properply opened/closed/flushed). It also allows to bypass the HDF module completely, but I think this might be too low level. I have no idea of performance comparisons, I could do a test to see if the memmap is faster. Do note that the mmap approach only works on contiguous uncompressed data, which is no longer our default...

jalew188 commented 3 years ago
  • Numpy string arrays are only supported when in object-mode hdf_file.b = np.array(['ABC', 'DEF'], dtype='object') -> works hdf_file.b = np.array(['ABC', 'DEF'])-> TypeError (TypeError: No conversion path for dtype: dtype('<U3')) Maybe we could include that it should be object mode when raising the Error so people can fix it themselves?

It is wired that in pd.DataFrame, we cannot save the df with pd object dtypes, and pd.NA is a pd object. For future extensions, does it make sense to try...except when saving the pd.Series. For example:

try:
    save the pd.Series
except TypeError:
    save pd.Series.fillna(np.nan) #string series must not contain pd.NA
except TypeError:
    save pd.Series.astype('U') #or do something other