DiamondLightSource / hdfmap

Map objects within an HDF file and create a dataset namespace
Apache License 2.0
2 stars 0 forks source link

In-depth review #3

Closed PeterC-DLS closed 3 months ago

PeterC-DLS commented 4 months ago

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/eval_functions.py#L18 better name is round_string_floats

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/eval_functions.py#L30 naughty is too subjective - unsafe is better. I'm not sure why you want to implement your own expression evaluator when numexpr exists!

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/eval_functions.py#L53 use getattr with a default

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/eval_functions.py#L104 doesn't seem to be worth it as Python supports expressions in f-strings

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/file_functions.py#L15 not sure why f = load_hdf('myfile'') is better than f = File('myfile')

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/file_functions.py#L20 IPYTHON magic may be sutiable

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/file_functions.py#L31 could be folded into HdfMap's constructor

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/file_functions.py#L31 ditto for NexusMap

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/file_functions.py#L73 should not invent a new term address for a path (similar to a filesystem path)

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/file_functions.py#L84 strange usage - perhaps more clearly expressed using a helper function that ensures you get a list

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L20 method name should suggest that this gets name from path and attempts to make it safe as an attribute identifier

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L22 use h5py.Dataset.asstr

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L24 is not really sufficient! See https://docs.python.org/3/reference/lexical_analysis.html#identifiers

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L44 use types.SimpleNamespace or argparse.Namespace

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L133 you should use the logging module then the logging level will determine any output

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L208 should not be necessary with the last item

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L219 simplify with get(..., 'Group')

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L222 better to check first and not depend on an exception

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L225 maybe use a namedtuple

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L229 use .items() here and elsewhere

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L232 use a defaultdict to simplify code

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L235 this implies you store a lot of NoClass paths unnecessarily

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L246 could GDA write datasets with shapes (M,N,1) for sums and monitors?

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L248 and other clauses will not prevent overwrites where there's many paths ending in the same name

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L286 is it worth omitting cases where local_name is missing or duplicates name?

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L314 you could use an assignment expression := here and elsewhere

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L346 iter(hdf_group) should work

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L369 needs a better name and if self.datasets was a namedtuple then could simplify

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L391 buggy for any top-level name

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L396 inaccurate here and in next method as this will match any addresses where the given argument is a substring of their names

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L439 return type is incorrect

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L457 could return None here and next method

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L527 get_scannables_table or make/create...

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L22 should also check it's an NXentry (or even a group)

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L25 find_nexus_data and return signal path and its axes (signal should be first in tuple)

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L41 reuse first method

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L44 should also search for first NXdata

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L47 assign to variable to use later

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L52 also look for data

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L56 the first method works just as quickly so you could add a strict flag to fail and make this method get the datasets

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L104 is there necessary? Could add a note to the base class instead.

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L152 this is a little unusual to override this method and _load_defaults - I suggest the latter method does not need to exist in base class.

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L168 again check the candidates are groups and entries

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L185 same as _scannables_from_scan_fields_or_nxdata and also _image_data_from_nxdetector?

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L224 mispelling in comment

https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/reloader_class.py#L53 better signature would be self, *name_or_address, index: slice = ()

DanPorter commented 4 months ago

Thanks very much Peter for this very in-depth and thorough code review!

It's taken me a while but I've carefully gone through all of your comments and made many changes. I've just pushed the new branch (see above) and will merge these into the main branch shortly.

Here are my replies to your commetns:

hdfmap/src/hdfmap/eval_functions.py

  1. _better name is round_stringfloats Done

  2. naughty is too subjective - unsafe is better. I'm not sure why you want to implement your own expression evaluator when numexpr exists! Changed name. numexpr is a different package but looks very useful, need to evaluate this. I've explored numexpr and it looks very useful and potentially beneficial due to potential speed and safety enchancements it could provide. However, from what I've seen so far, what it can calculate is very restrictive and things like array operations are not allowed. Also, many numpy operations are not included (such as mean) or restricted (such as sum). I will continue investigating this and may consider adding support for numexpr in the future.

  3. use getattr with a default Done.

  4. doesn't seem to be worth it as Python supports expressions in f-strings This is required because I pass through the evaluation step twice - I need to load data for any dataset names in the string into the locals namespace before the final evaluation. Whist I could still use the f-string syntax, this would become cumbersome for many argmuments.

hdfmap/src/hdfmap/file_functions.py

  1. _not sure why f = loadhdf('myfile'') is better than f = File('myfile') Future proofing, e.g. incase I want to replace all calls with swmr

  2. IPYTHON magic may be sutiable I normally consider IPYTHON magic suitable for changes to the environment, such as installing packages or changing the directory, so I'm not sure an IPYTHON magic command to load files would be preferred.

  3. could be folded into HdfMap's constructor Done, documentation and tests updated.

  4. ditto for NexusMap as above

  5. should not invent a new term address for a path (similar to a filesystem path) Changed 'address' to 'path' everywhere.

  6. strange usage - perhaps more clearly expressed using a helper function that ensures you get a list Created file_functions.as_str_list to replace this behaviour.

hdfmap/src/hdfmap/hdfmap_class.py

  1. method name should suggest that this gets name from path and attempts to make it safe as an attribute identifier Renamed address2name to 'generate_identifier'

  2. use h5py.Dataset.asstr Dataset.asstr is very useful and I hadn't seen this before, however it is not appropiate here as the path is not generated from the dataset.

  3. _is not really sufficient! See https://docs.python.org/3/reference/lexical_analysis.html#identifiers_ Replacing '.' with '' fixes all problems I've seen in hdf files so far, but I agree it isn't sufficient. However, I am not aware of any way to fully sanitise an identifier - regex doesn't capture non-ascii characters and 'str'.isidentifier() only returns true/false. For now, I've gone half way and used a regular expression to remove special characters: re.sub('\W', '', path)

  4. use types.SimpleNamespace or argparse.Namespace I've looked at types.SimpleNamespace, argparse.Namespace and namedtuple but none of these quite offer the desired behaviour of the old scisoftpy.dictutils.DataHolder that I have tried to replicate. I have amended the names and documentation to make this clearer.

  5. you should use the logging module then the logging level will determine any output Logging replaced with logging module.

  6. should not be necessary with the last item Done.

  7. simplify with get(..., 'Group') Done.

  8. better to check first and not depend on an exception This check was not required anyway for a h5py.Group

  9. maybe use a namedtuple This is a nice solution, thanks! I've replaced the tuples in HdfMap.groups and HdfMap.datasets with namedtuple classes.

  10. use .items() here and elsewhere Done.

  11. use a defaultdict to simplify code Done.

  12. this implies you store a lot of NoClass paths unnecessarily NoClass should never have happened anyway and has been removed.

  13. could GDA write datasets with shapes (M,N,1) for sums and monitors? Yes it could with 2D scans etc but I'm not sure of another non-nexus way to identify detectors. Note that in nexus files the self.image_data dict is overwritten by self._image_data_from_nxdetector()

  14. _self.imagedata[name] = address, and other clauses will not prevent overwrites where there's many paths ending in the same name Paths ending in the same name will overwrite previous names, this is unavoidable but I've made this clear in the documentation that the name points to the last path. I could also add something like 'group_name' but I'm not sure how useful this would be.

  15. _is it worth omitting cases where localname is missing or duplicates name? This is now omitted when local_name is not there.

  16. you could use an assignment expression := here and elsewhere This is useful, I'd not seen assignment expressions before. I have used them in a few places to reduce calls.

  17. _iter(hdfgroup) should work Removed .keys().

  18. needs a better name and if self.datasets was a namedtuple then could simplify Because I have replaced self.datasets with nametuples, I have removed this and calling functions as not required.

  19. buggy for any top-level name empty hdf_paths from top-level ('/') now return '/'

  20. inaccurate here and in next method as this will match any addresses where the given argument is a substring of their names Corrected the docstr to specify that this will find any path containing the argument. Also changed the name to find_paths and added find_names to make it clear which objects are being returned.

  21. return type is incorrect Fixed to decode the bytes string.

  22. could return None here and next method None is returned by default, have updated.

  23. _get_scannablestable or make/create... Done.

hdfmap/src/hdfmap/nexus.py

  1. should also check it's an NXentry (or even a group) I have added checks for NXentry but this is now pretty horrible. I wonder if there is a simpler way to do this?

  2. _find_nexusdata and return signal path and its axes (signal should be first in tuple) Renamed, but I'm not clear why signal should be first. For me, it is more logical to return x, y

  3. reuse first method Done.

  4. should also search for first NXdata Added search for NXdata if default or 'measurement' not found.

  5. assign to variable to use later Done. Suddenly I'm using assignment expressions everywhere!

  6. also look for data I'm assuming that if 'signal' is an attribute then it will exist in the NXdata group.

  7. the first method works just as quickly so you could add a strict flag to fail and make this method get the datasets This function isn't used in the rest of the code and is included for testing pruposes. Rather than adding a flag to other functions, it is easy enough to import this function as required.

  8. is there necessary? Could add a note to the base class instead. I have removed _load_defaults from the base class and renamed the function to make it nexus specific.

  9. _this is a little unusual to override this method and _load_defaults - I suggest the latter method does not need to exist in base class._ Fixed above.

  10. again check the candidates are groups and entries Done.

  11. _same as _scannables_from_scan_fields_or_nxdata and also _image_data_fromnxdetector? These functions now replace this code (I had meant to do this already).

  12. mispelling in comment Done, but comment has now been removed anyway.

hdfmap/src/hdfmap/reloader_class.py

  1. _better signature would be self, *name_oraddress, index: slice = () Done.
DanPorter commented 3 months ago

Merged pull request #4. addressing all suggestions above and those in the pr. Thanks again @PeterC-DLS!