Closed sterlingbaldwin closed 3 years ago
I think your plan sounds reasonable. I was a little worried about conflicts yesterday, but having thought more about it I dont think there should be issues.
Ok, great. A question. Since the warehouse and publication "roots" are supplied by a startup-config file, these values are not going to be available to codes in the "util.py" or related common-location stuff (assuming that is where codes like "atomic_statusfile()" would be located). This means that such utilities will need to be passed these values as parameters, yes?
Passing them via parameters is probably the best way to do it.
I wrote a short driver and successfully tested "consolidate_statusfile_to()", and consequently added it to the warehouse "util.py" (with git push to branch "move_to_publication_update".
Before I modify "move_to_publication" to employ it, I'd like to consider the preferred logging that should be applied.
see https://github.com/E3SM-Project/esgfpub/pull/49
As for logging, you should just have to do
import logging
at the top of the file, and then use logging.info(msg)
for info messages, or logging.error(msg)
for error messages.
OK - but when a function intends to return a "path" on success, should it necessarily return '' (empty) on error? Or should it communicate more to the calling process.
Potentially, one calling process my need to abort/exit on the error, while another (an assessment tool) might not. (Maybe I haven't thought this through. But I'll go ahead and try the logging.)
Here is a question: The process in question (script = move_to_publication.py) is called by the "job" "movetopublication.py", and it is the latter that is in possession of the dataset object, which I believe holds the location of the status file. Therefore:
re (1) above: Or, does every access of the status file seek the location afresh?
Sorry, just saw your comment (you can tag people like @TonyB9000 to alert them). When the dataset opens the file to update the status list, it does so in "append" mode. I think source -> dest
is the best way to do it instead of using special keyworks like "warehouse" or "w".
@sterlingbaldwin Heh Sorry about that. Thing is:
I was asking if every write to update a status file reseeks the location of the file (warehouse, publication, etc) or simply caches the location on start up, and
All of that is moot, since all status files are now located solely in staging/status/
A separate problem encountered: Because we do not "pause" when writing (appending) multiple entries to a status file, and multiple entries may have the same timestamp . . . a sorted file may no longer have the last entry as "last". (I used a sort generically as a requirement for merging two status files.
For the future, we either need (say) ms (YYYYMMDD_hhmmss.ms), or else each writer must sleep a second between writes.
I will need to examine the status files to ensure that the correct last-entry is last. But more generally, it is brittle to rely upon the last entry as "the last I am interested in". We should employ "last_for_this_purpose", or "last_regarding_topic"...
I've mentioned before, "Append Only" is not a guarantee that "Last Only" will give you what you expect. You may yet require reading in the entries and processing multiple lines.
When the status file watcher triggers an update, the dataset in question re-loads and time-sorts the contents of its status file. I still need to do the modifications to switch over to using a single status file directory.
@sterlingbaldwin (I will need to walk through the codes to ensure that "staging/status/" is the path to each status file, now having the name format
We Both ... :)
Thanks for closing that - need a github dashboard.
My initial plan involves updating “move_to_publication” to include the move of the status file, and consolidation of status files if two status files exist. "Consolidation" is simply the sorted union of the lines of the two files.
In addition, for added robustness, I want to implement an "atomic_statusfile" functionality that return you the full path to a status file, no matter where (among warehouse and publication) it may be located, that will throw an error ir more or less that exactly ONE status file is located. This could be used for either access or existence-testing.
These functionalities "assume" there exist default "warehouse" and "publication" root locations, typically provided in a given warehouse-config at startup.
I want to write a “consolidate_statusfile_to” function:
But as convenient as this would be, it might lead to clobbering something. You mentioned the following scenario:
You expressed a reticence to add these “extraction” and “validation” records to the publication status file. So I am trying to work through what “bad things” might therefore happen.
Could you provide an extended scenario where adding these new status values describing the data set would cause a negative issue to develop? If not, I will code up "consolidate_statusfile_to" as described above. Otherwise, add an parameter that mitigates the behavior (and what mitigation is desired.)