Closed PeterC-DLS closed 3 months ago
Thanks very much Peter for this very in-depth and thorough code review!
It's taken me a while but I've carefully gone through all of your comments and made many changes. I've just pushed the new branch (see above) and will merge these into the main branch shortly.
Here are my replies to your commetns:
hdfmap/src/hdfmap/eval_functions.py
_better name is round_stringfloats Done
naughty is too subjective - unsafe is better. I'm not sure why you want to implement your own expression evaluator when numexpr exists! Changed name. numexpr is a different package but looks very useful, need to evaluate this. I've explored numexpr and it looks very useful and potentially beneficial due to potential speed and safety enchancements it could provide. However, from what I've seen so far, what it can calculate is very restrictive and things like array operations are not allowed. Also, many numpy operations are not included (such as mean) or restricted (such as sum). I will continue investigating this and may consider adding support for numexpr in the future.
use getattr with a default Done.
doesn't seem to be worth it as Python supports expressions in f-strings This is required because I pass through the evaluation step twice - I need to load data for any dataset names in the string into the locals namespace before the final evaluation. Whist I could still use the f-string syntax, this would become cumbersome for many argmuments.
hdfmap/src/hdfmap/file_functions.py
_not sure why f = loadhdf('myfile'') is better than f = File('myfile') Future proofing, e.g. incase I want to replace all calls with swmr
IPYTHON magic may be sutiable I normally consider IPYTHON magic suitable for changes to the environment, such as installing packages or changing the directory, so I'm not sure an IPYTHON magic command to load files would be preferred.
could be folded into HdfMap's constructor Done, documentation and tests updated.
ditto for NexusMap as above
should not invent a new term address for a path (similar to a filesystem path) Changed 'address' to 'path' everywhere.
strange usage - perhaps more clearly expressed using a helper function that ensures you get a list Created file_functions.as_str_list to replace this behaviour.
hdfmap/src/hdfmap/hdfmap_class.py
method name should suggest that this gets name from path and attempts to make it safe as an attribute identifier Renamed address2name to 'generate_identifier'
use h5py.Dataset.asstr Dataset.asstr is very useful and I hadn't seen this before, however it is not appropiate here as the path is not generated from the dataset.
_is not really sufficient! See https://docs.python.org/3/reference/lexical_analysis.html#identifiers_ Replacing '.' with '' fixes all problems I've seen in hdf files so far, but I agree it isn't sufficient. However, I am not aware of any way to fully sanitise an identifier - regex doesn't capture non-ascii characters and 'str'.isidentifier() only returns true/false. For now, I've gone half way and used a regular expression to remove special characters: re.sub('\W', '', path)
use types.SimpleNamespace or argparse.Namespace I've looked at types.SimpleNamespace, argparse.Namespace and namedtuple but none of these quite offer the desired behaviour of the old scisoftpy.dictutils.DataHolder that I have tried to replicate. I have amended the names and documentation to make this clearer.
you should use the logging module then the logging level will determine any output Logging replaced with logging module.
should not be necessary with the last item Done.
simplify with get(..., 'Group') Done.
better to check first and not depend on an exception This check was not required anyway for a h5py.Group
maybe use a namedtuple This is a nice solution, thanks! I've replaced the tuples in HdfMap.groups and HdfMap.datasets with namedtuple classes.
use .items() here and elsewhere Done.
use a defaultdict to simplify code Done.
this implies you store a lot of NoClass paths unnecessarily NoClass should never have happened anyway and has been removed.
could GDA write datasets with shapes (M,N,1) for sums and monitors?
Yes it could with 2D scans etc but I'm not sure of another non-nexus way to identify detectors. Note that in nexus files the self.image_data dict is overwritten by self._image_data_from_nxdetector()
_self.imagedata[name] = address, and other clauses will not prevent overwrites where there's many paths ending in the same name Paths ending in the same name will overwrite previous names, this is unavoidable but I've made this clear in the documentation that the name points to the last path. I could also add something like 'group_name' but I'm not sure how useful this would be.
_is it worth omitting cases where localname is missing or duplicates name? This is now omitted when local_name is not there.
you could use an assignment expression := here and elsewhere This is useful, I'd not seen assignment expressions before. I have used them in a few places to reduce calls.
_iter(hdfgroup) should work Removed .keys().
needs a better name and if self.datasets was a namedtuple then could simplify Because I have replaced self.datasets with nametuples, I have removed this and calling functions as not required.
buggy for any top-level name empty hdf_paths from top-level ('/') now return '/'
inaccurate here and in next method as this will match any addresses where the given argument is a substring of their names Corrected the docstr to specify that this will find any path containing the argument. Also changed the name to find_paths and added find_names to make it clear which objects are being returned.
return type is incorrect Fixed to decode the bytes string.
could return None here and next method None is returned by default, have updated.
_get_scannablestable or make/create... Done.
hdfmap/src/hdfmap/nexus.py
should also check it's an NXentry (or even a group) I have added checks for NXentry but this is now pretty horrible. I wonder if there is a simpler way to do this?
_find_nexusdata and return signal path and its axes (signal should be first in tuple) Renamed, but I'm not clear why signal should be first. For me, it is more logical to return x, y
reuse first method Done.
should also search for first NXdata Added search for NXdata if default or 'measurement' not found.
assign to variable to use later Done. Suddenly I'm using assignment expressions everywhere!
also look for data I'm assuming that if 'signal' is an attribute then it will exist in the NXdata group.
the first method works just as quickly so you could add a strict flag to fail and make this method get the datasets This function isn't used in the rest of the code and is included for testing pruposes. Rather than adding a flag to other functions, it is easy enough to import this function as required.
is there necessary? Could add a note to the base class instead. I have removed _load_defaults from the base class and renamed the function to make it nexus specific.
_this is a little unusual to override this method and _load_defaults
- I suggest the latter method does not need to exist in base class._
Fixed above.
again check the candidates are groups and entries Done.
_same as _scannables_from_scan_fields_or_nxdata and also _image_data_fromnxdetector? These functions now replace this code (I had meant to do this already).
mispelling in comment Done, but comment has now been removed anyway.
hdfmap/src/hdfmap/reloader_class.py
Merged pull request #4. addressing all suggestions above and those in the pr. Thanks again @PeterC-DLS!
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/eval_functions.py#L18 better name is
round_string_floats
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/eval_functions.py#L30 naughty is too subjective - unsafe is better. I'm not sure why you want to implement your own expression evaluator when numexpr exists!
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/eval_functions.py#L53 use
getattr
with a defaulthttps://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/eval_functions.py#L104 doesn't seem to be worth it as Python supports expressions in f-strings
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/file_functions.py#L15 not sure why
f = load_hdf('myfile'')
is better thanf = File('myfile')
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/file_functions.py#L20 IPYTHON magic may be sutiable
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/file_functions.py#L31 could be folded into
HdfMap
's constructorhttps://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/file_functions.py#L31 ditto for
NexusMap
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/file_functions.py#L73 should not invent a new term address for a path (similar to a filesystem path)
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/file_functions.py#L84 strange usage - perhaps more clearly expressed using a helper function that ensures you get a list
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L20 method name should suggest that this gets name from path and attempts to make it safe as an attribute identifier
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L22 use
h5py.Dataset.asstr
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L24 is not really sufficient! See https://docs.python.org/3/reference/lexical_analysis.html#identifiers
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L44 use types.SimpleNamespace or
argparse.Namespace
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L133 you should use the logging module then the logging level will determine any output
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L208 should not be necessary with the last item
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L219 simplify with
get(..., 'Group')
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L222 better to check first and not depend on an exception
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L225 maybe use a namedtuple
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L229 use
.items()
here and elsewherehttps://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L232 use a
defaultdict
to simplify codehttps://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L235 this implies you store a lot of
NoClass
paths unnecessarilyhttps://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L246 could GDA write datasets with shapes (M,N,1) for sums and monitors?
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L248 and other clauses will not prevent overwrites where there's many paths ending in the same name
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L286 is it worth omitting cases where
local_name
is missing or duplicatesname
?https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L314 you could use an assignment expression
:=
here and elsewherehttps://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L346
iter(hdf_group)
should workhttps://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L369 needs a better name and if
self.datasets
was a namedtuple then could simplifyhttps://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L391 buggy for any top-level name
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L396 inaccurate here and in next method as this will match any addresses where the given argument is a substring of their names
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L439 return type is incorrect
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L457 could return
None
here and next methodhttps://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/hdfmap_class.py#L527
get_scannables_table
or make/create...https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L22 should also check it's an
NXentry
(or even a group)https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L25
find_nexus_data
and return signal path and its axes (signal should be first in tuple)https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L41 reuse first method
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L44 should also search for first
NXdata
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L47 assign to variable to use later
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L52 also look for
data
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L56 the first method works just as quickly so you could add a
strict
flag to fail and make this method get the datasetshttps://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L104 is there necessary? Could add a note to the base class instead.
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L152 this is a little unusual to override this method and
_load_defaults
- I suggest the latter method does not need to exist in base class.https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L168 again check the candidates are groups and entries
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L185 same as
_scannables_from_scan_fields_or_nxdata
and also_image_data_from_nxdetector
?https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/nexus.py#L224 mispelling in comment
https://github.com/DanPorter/hdfmap/blob/4175cb29cd3dedee9a5e520ed8f30a80a980b326/src/hdfmap/reloader_class.py#L53 better signature would be
self, *name_or_address, index: slice = ()