bluesky / tiled

API to structured data
https://blueskyproject.io/tiled
BSD 3-Clause "New" or "Revised" License
59 stars 50 forks source link

Wrong extension grabbed from file name #174

Closed prjemian closed 2 years ago

prjemian commented 2 years ago

With the first two of these files, the correct file extension was not carved away from the full file name (third was just not recognized):

/.conda/envs/db2.0/lib/python3.9/site-packages/tiled/adapters/files.py:664: UserWarning: The file at /projects/punx/punx/data/33837rear_1D_1.75_16.5_NXcanSAS_v3.h5 has a file extension .75_16.5_NXcanSAS_v3.h5 this is not recognized. The file will be skipped, pass in a mimetype for this file extension via the parameter DirectoryAdapter.from_directory(..., mimetypes_by_file_ext={...}) and pass in a Reader than handles this mimetype via the parameter DirectoryAdapter.from_directory(..., readers_by_mimetype={...}).
/.conda/envs/db2.0/lib/python3.9/site-packages/tiled/adapters/files.py:664: UserWarning: The file at /projects/punx/punx/data/prj_test.nexus.hdf5 has a file extension .nexus.hdf5 this is not recognized. The file will be skipped, pass in a mimetype for this file extension via the parameter DirectoryAdapter.from_directory(..., mimetypes_by_file_ext={...}) and pass in a Reader than handles this mimetype via the parameter DirectoryAdapter.from_directory(..., readers_by_mimetype={...}).
/.conda/envs/db2.0/lib/python3.9/site-packages/tiled/adapters/files.py:664: UserWarning: The file at /projects/punx/punx/data/verysimple.nx5 has a file extension .nx5 this is not recognized. The file will be skipped, pass in a mimetype for this file extension via the parameter DirectoryAdapter.from_directory(..., mimetypes_by_file_ext={...}) and pass in a Reader than handles this mimetype via the parameter DirectoryAdapter.from_directory(..., readers_by_mimetype={...}).

Two of the affected files have several . in the file name:

The third one is an extension that is not recognized:

Warning is reported by this code: https://github.com/bluesky/tiled/blob/3d269d444311b217af9925b935fdd7ab3773a56c/tiled/adapters/files.py#L664

This line defines ext: https://github.com/bluesky/tiled/blob/3d269d444311b217af9925b935fdd7ab3773a56c/tiled/adapters/files.py#L647

prjemian commented 2 years ago

Given the first example:

In [7]: path = pathlib.Path("/projects/punx/punx/data/33837rear_1D_1.75_16.5_NXcanSAS_v3.h5")

In [8]: path.suffixes
Out[8]: ['.75_16', '.5_NXcanSAS_v3', '.h5']

then it is this line which is the point of failure in the algorithm: https://github.com/bluesky/tiled/blob/3d269d444311b217af9925b935fdd7ab3773a56c/tiled/adapters/files.py#L647

The assumption here is that an ext is everything after the first . (because of multi . extensions such as .tar.gz). We can't control that when we serve file directories.

danielballan commented 2 years ago

The .h5 file extension is now handled correctly out of the box.

The other more exotic suffixes mentioned about can be handled by a new feature implemented in v0.1.0a67 and documented at https://blueskyproject.io/tiled/how-to/read-custom-formats.html.