E3SM-Project / datasm

A data state machine for automating complex nested workflows for handling E3SM outputs.
MIT License
1 stars 2 forks source link

Move status file to publication directories when data files are moved #48

Closed sterlingbaldwin closed 3 years ago

TonyB9000 commented 3 years ago

My initial plan involves updating “move_to_publication” to include the move of the status file, and consolidation of status files if two status files exist. "Consolidation" is simply the sorted union of the lines of the two files.

In addition, for added robustness, I want to implement an "atomic_statusfile" functionality that return you the full path to a status file, no matter where (among warehouse and publication) it may be located, that will throw an error ir more or less that exactly ONE status file is located. This could be used for either access or existence-testing.

These functionalities "assume" there exist default "warehouse" and "publication" root locations, typically provided in a given warehouse-config at startup.

I want to write a “consolidate_statusfile_to” function:

if no status file is found – it does nothing – could return an error (that the calling code could choose to ignore).
if only one status file is found, and it is already in the desired location, it returns happy.
if only one status file is found, and it is NOT in the desired location, it is moved.
if two status files are found, they are sort/merged, and removed from any locations that are not the desired location.

But as convenient as this would be, it might lead to clobbering something. You mentioned the following scenario:

  1. A given dataset has been fully and successfully published.
  2. For some reason, the raw archive dataset is (re)extracted to the warehouse, where (perhaps new) validation procedures are being applied. The results of this new testing might indicate a corrected (re)publication is warranted.

You expressed a reticence to add these “extraction” and “validation” records to the publication status file. So I am trying to work through what “bad things” might therefore happen.

Could you provide an extended scenario where adding these new status values describing the data set would cause a negative issue to develop? If not, I will code up "consolidate_statusfile_to" as described above. Otherwise, add an parameter that mitigates the behavior (and what mitigation is desired.)

sterlingbaldwin commented 3 years ago

I think your plan sounds reasonable. I was a little worried about conflicts yesterday, but having thought more about it I dont think there should be issues.

TonyB9000 commented 3 years ago

Ok, great. A question. Since the warehouse and publication "roots" are supplied by a startup-config file, these values are not going to be available to codes in the "util.py" or related common-location stuff (assuming that is where codes like "atomic_statusfile()" would be located). This means that such utilities will need to be passed these values as parameters, yes?

sterlingbaldwin commented 3 years ago

Passing them via parameters is probably the best way to do it.

TonyB9000 commented 3 years ago

I wrote a short driver and successfully tested "consolidate_statusfile_to()", and consequently added it to the warehouse "util.py" (with git push to branch "move_to_publication_update".

Before I modify "move_to_publication" to employ it, I'd like to consider the preferred logging that should be applied.

sterlingbaldwin commented 3 years ago

see https://github.com/E3SM-Project/esgfpub/pull/49

As for logging, you should just have to do

import logging

at the top of the file, and then use logging.info(msg) for info messages, or logging.error(msg) for error messages.

TonyB9000 commented 3 years ago

OK - but when a function intends to return a "path" on success, should it necessarily return '' (empty) on error? Or should it communicate more to the calling process.

TonyB9000 commented 3 years ago

Potentially, one calling process my need to abort/exit on the error, while another (an assessment tool) might not. (Maybe I haven't thought this through. But I'll go ahead and try the logging.)

TonyB9000 commented 3 years ago

Here is a question: The process in question (script = move_to_publication.py) is called by the "job" "movetopublication.py", and it is the latter that is in possession of the dataset object, which I believe holds the location of the status file. Therefore:

  1. Will the subordinate script "break" the accessibility of the status file by moving it?
  2. Since "move_to_publication" is already in possession of the source and dest file locations (whose parents are the assumed locations for mapfiles and statusfiles), should I employ a simpler "consolidate" that assumes you will always want to consolidate in the direction source -> dest? I wrote it where you could specify "to warehouse" or "to publication", which is redundant for the given "move_to_publication" case, and requires being passed the specific warehouse and publication roots, and the datasetID in order to construct the required paths. But if I write it more simply, it will only work in conjunction with this "move" operation. Maybe I should write it both ways? Easy enough...
TonyB9000 commented 3 years ago

re (1) above: Or, does every access of the status file seek the location afresh?

sterlingbaldwin commented 3 years ago

Sorry, just saw your comment (you can tag people like @TonyB9000 to alert them). When the dataset opens the file to update the status list, it does so in "append" mode. I think source -> dest is the best way to do it instead of using special keyworks like "warehouse" or "w".

TonyB9000 commented 3 years ago

@sterlingbaldwin Heh Sorry about that. Thing is:

A separate problem encountered: Because we do not "pause" when writing (appending) multiple entries to a status file, and multiple entries may have the same timestamp . . . a sorted file may no longer have the last entry as "last". (I used a sort generically as a requirement for merging two status files.

For the future, we either need (say) ms (YYYYMMDD_hhmmss.ms), or else each writer must sleep a second between writes.

I will need to examine the status files to ensure that the correct last-entry is last. But more generally, it is brittle to rely upon the last entry as "the last I am interested in". We should employ "last_for_this_purpose", or "last_regarding_topic"...

I've mentioned before, "Append Only" is not a guarantee that "Last Only" will give you what you expect. You may yet require reading in the entries and processing multiple lines.

sterlingbaldwin commented 3 years ago

When the status file watcher triggers an update, the dataset in question re-loads and time-sorts the contents of its status file. I still need to do the modifications to switch over to using a single status file directory.

TonyB9000 commented 3 years ago

@sterlingbaldwin (I will need to walk through the codes to ensure that "staging/status/" is the path to each status file, now having the name format .status)

TonyB9000 commented 3 years ago

We Both ... :)

TonyB9000 commented 3 years ago

Thanks for closing that - need a github dashboard.