Quasars / orange-spectroscopy

Other
52 stars 58 forks source link

What file format for Hyperspectral image is the best to use for export to Orange? #656

Closed pisarik closed 1 year ago

pisarik commented 1 year ago

I would like to make export of Hyperspectral images from a database to Orange. Images are stored in some internal format, so I need first to assemble a hyperspectral image, then save it and open in Orange. From the io module, I saw that ascii format is supported, but those are text files and they are not very well suit to store images, since they will double the size of images and a laptop's memory is precious:)

Do you know maybe some open, simple and binary format for hyperspectral images? Should we implement maybe one ascii-like, but binary? It would be also nice to add description of the format for export to Orange in documentation maybe.

borondics commented 1 year ago

I think you could try the PTIR studio format. They are HDF5 inside, just have the .ptir extension...

pisarik commented 1 year ago

Thanks, @borondics ! HDF5 is perfect. Do you know whether they have an io-library or maybe a specification for the format? I just opened PTIRFileReader for a minute and the file structure does not appear to be so straightforward. I mean all the keys and requirements

borondics commented 1 year ago

I don't think there is a description. However, if you check the test files here you will see how to structure it. If you make a writer function it could be nice to push that code to Orange Spectroscopy too...

pisarik commented 1 year ago

Ok, thanks! I will share the writer for .ptir, if will do it:)

Would it be nice to have just a simple binary format to pass any table in Orange? For example, a flat HDF5 file, where each (key, value) represents column's name and its values (1D-array)? Or maybe a more specific one for the add-on with key spectra (2D array), wavenumbers (1D array) and the rest key-column values.

stuart-cls commented 1 year ago

Thanks for raising this, I think about this problem often! Let's keep this issue open until a satisfactory solution for hyperspectral data is found.

Regarding a more general binary format representation of Orange data Tables, I suggest looking at @markotoplak work on an "Orange On-disk Format" in HDF5, although I think it's not stabilized yet.

AlexHenderson commented 1 year ago

Hi, HDF5 is a good choice, as is Zarr, if you plan to stay on Python. Both of these are chunked and have a range of compression features built in. Therefore they are great for out of core processing since you only read the piece of the data you require. They are Dask compatible too for parallel read/writes.

The Photothermal PTIR format is (I believe) designed to be compatible with USID, so https://pycroscopy.github.io/pyUSID/ would be a good place to start.

I suggest not trying to export to another vendor’s format since if they change anything, you need to keep step with them, and it may make break your structure.

I'm also interested in TileDB Embedded, which looks proprietary, but the file format is open source. I've been told that the TileDB format can stream Apache Arrow format, which is very interesting. TileDB is cross platform like HDF5, but Zarr is really only available for Python right now.

https://zarr.readthedocs.io/en/stable/ https://tiledb.com/products/tiledb-embedded

borondics commented 1 year ago

We could also think about Nexus. It is based on HDF5 and can hold lots of metadata. https://github.com/nexusformat

pisarik commented 1 year ago

Thank you @borondics and @AlexHenderson for the suggestions! I will try out than pure HDF5, TileDB-embedded and Nexus formats for storing Hyperspectral images and then report my findings here.

It is also interesting to see the @markotoplak work on "Orange On-disk Format" in HDF5, but I cannot find it. @stuart-cls could you please send a link to it?

markotoplak commented 1 year ago

@pisarik, check the dask branch on the main orange3 repository.