HERMES-SOC / hermes_core

A central Python Package for common functionality across all hermes instruments
Other
2 stars 6 forks source link

HermesData.save functionality #119

Open kbromund opened 5 months ago

kbromund commented 5 months ago

the HermesData.save method has two issues: output_path is a misnomer. It is actually specifies the output_directory, not the output filename. Should there be a mechanism for explicitly setting the output file name, rather than generating one automatically? When running in the mode where a pathname is generated automatically,
overwrite=False should increment the Z-version number. e.g. if version 1.0.0 exists, it should create a file with version 1.0.1.
the documentation should make it clear that overwrite=True should only be used in development, not in production. In production, new files should always have new version numbers.

Alrobbertz commented 4 months ago

Hi Ken - I'm going to treat this as two mini-issues and will likely generate 2 different PRs for each of the aspects.

kbromund commented 4 months ago

Hi Aaron, I am ok with splitting them up, but I would still like to discuss each of your suggested implementations.

mini-issue 1)

I'm not sure we really need this 'not recommended' option for output_path.

I was confused, because there is ambiguity in the existing code. The l0_sci_data_to_cdf function contains code to build a path name, even though the path name is generated by the HermesData save function. I'm now thinking we should discard the former, and keep the latter.

Given the design of the hermes_core, I think it makes sense for the save function to always determine the file name based on the content of the HermesData structure. In terms of ISTP metadata, the filename is linked to the Logical_file_id , which is built up from Logical_source, plus a timestamp, plus the Data_version. Because ISTP requirements are enforced in the HermesData structure before allowing a call to save , the filename is uniquely determined. It shouldn't be up to the user to build their own.

By the way, I do see a use case for allowing the instrument software to specify the time stamp (rather than determining it automatically based on the content of the TimeSeries). But that's yet another discussion, and in any case, this should be explicitly set within the HermesData structure before calling save.

For testing, the developer doesn't need to specify a filename, but only a test directory.

As I said originally, I recommend renaming output_path to output_directory, and enforce that it points to a valid directory.

mini-issue 2) I don't think ['major', 'minor', 'bugfix'] are the right categories, because we are talking about data files, not code. I suggest that we follow the MMS SOC's paradigm for data file versioning, where the categories are more like: 'algorithm version', 'calibration version', and 'file version'. On MMS, they are simply referred to as ['X', 'Y', 'Z'], because the exact meaning can be instrument-specific. The guidelines are: X-version is the 'software version' or 'algorithm version' and would generally be hardcoded in the calibration software that creates the data file, and generally implies a significant re-processing. In particular, we should strongly recommend that version X software generates version X files. Y-version is the 'calibration version', and would normally be set to correspond to the version of the calibration file (or however the team feels it is appropriate to specify the calibration version).
Z-version is the 'file version', which should be incremented any time a file with the same Logical_source, time stamp, X and Y version gets re-created. For example, Z will be incremented when re-creating a day-long file when a more complete set of L0 data becomes available.

With these guidelines in mind, I don't see a use case for the increment_version switch. One would not increment X or Y based on what is in the file system. It should be up to the instrument software to set X and Y as appropriate within HermesData before calling save. It's only the Z version that depends on what files have already been ingested by the SOC, which makes it natural to leave it undetermined until calling save. I think we should discuss this in person...

Alrobbertz commented 4 months ago

@kbromund for mini-issue (1) I created PR #121 which adds some more documentation on what that path parameter means. Do you think this is sufficient, or do you feel strongly that this parameter should be renamed to output_directory?

kbromund commented 4 months ago

Our messages crossed. I was writing my comments while you were already working on PR#121. I think we should hold off on PR#121 until we have had a chance to discuss my comments above.

Alrobbertz commented 4 months ago

Hi @kbromund - I just want to go over what we talked about today with this issue so I don't forget or loose track.

  1. Rename the output_path parameter to output_directory for the HermesData.save() function and check that it is a valid directory.

    • This will require making a number of changes in the instrument-specific packages as well as within hermes_core. Would you like me to make these changes before you submit a PR for the nemisis processing code or shall I wait for that PR to complete so we have a better idea of how we can make this more clear?
    • I'll close #121 since this is not what we were aiming for.
  2. The Data_version parameter, that is used for file name generation, is set as a string "X.Y.Z" that can be set through the global metadata attributes. Do you think this interface provides enough control, or would you still like functionality in the HermesData.save() function that will automatically increment the Data_version "Z" value if a file with the given name already exists?

  3. I'd like to move the discussion of the time formatting to a new issue since that may be more complex. But, in general, we might want to create a new parameter for the HermesData.global_attribute_template() function that is something like time_format where the instrument teams can override the level of granularity/specificity in the file name's start time depending on the true start time of the 24hour data interval.

Please let me know if I'm missing any details on the first two issues. Thank you again for the great feedback!