Closed ppinchuk closed 1 year ago
It's definitely a surprising result - one which caused quite a bit of confusion this past week.
I think the reason we don't overwrite pass-through datasets is to avoid repeatedly writing the same (necessarily identical) pass-through dataset from each single-year file during processing. In other words, we indiscriminately call the _copy_dset function on every dataset of every single-year generation file, and expect it not to do any extra work on those identical pass-through datasets.
Now that I think about it, this may actually cause problems with the dsets
inputs as well if re-running the multi-year step with an existing multi-year gen file. But perhaps the simplest solution to this problem is to ask the user to delete the file before re-running?
Curious to hear your thoughts @grantbuster
I guess that or automatically purge any existing multi-year file at the beggining of processing. But I wonder if purging user data would open it's own can of worms...
Patch coverage: 100.00%
and project coverage change: +0.05%
:tada:
Comparison is base (
10e39c5
) 86.98% compared to head (0cdc9a9
) 87.04%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I guess that or automatically purge any existing multi-year file at the beggining of processing. But I wonder if purging user data would open it's own can of worms...
My preference would be to purge any target output file if it already exists... imagine if jobs failed while halfway done collecting data. You would get garbage results without warning because the dataset would have initialized but not finished writing. Probably a dramatic and rare example but possible. For collect and multi-year i think this makes sense. For Gen i think we probably overwrite datasets regardless because we init first then write later?
Yes, I tend to agree. Generation just overwrite the files completely, so purging the MY file before processing makes things more consistent. I've updated the code accordingly
Yes, I tend to agree. Generation just overwrite the files completely, so purging the MY file before processing makes things more consistent. I've updated the code accordingly
Should we check collect while we're at it?
The collect logic is handled in GAPs, so we'd have to make changes there. Looks like you guys already thought about this and added a "clobber" option. However, looks like this was set to False
by default and not exposed to the user for some reason.
I'll go ahead and expose it to the CLI, and update the default value to True
, thereby giving the user full control while also maintaining reasonable default behavior
I guess we can add a "clobber" option with a default True
value to MY collect for consistency. What do you think?
I guess we can add a "clobber" option with a default
True
value to MY collect for consistency. What do you think?
I'm indifferent. the MY module is really meant for scalar means and should run instantly so clobber should always be True. But desire for uniformity thinks having a clobber option would be nice.
If you really are indifferent, I'll probably go ahead and add it. A default True
option will purge files (without any extra user intervention) in all cases we can think of at the moment where files should be clobbered. But as soon as an exception to this rule inevitably shows up, the power will already be in the users hands :)
If you really are indifferent, I'll probably go ahead and add it. A default
True
option will purge files (without any extra user intervention) in all cases we can think of at the moment where files should be clobbered. But as soon as an exception to this rule inevitably shows up, the power will already be in the users hands :)
Yep sounds good to me; I am truly indifferent.
Sounds good. Added that in
@grantbuster any other requests or am I good to merge?
@grantbuster Can you let me know if the warning text makes sense? If not, what would be a good way to rephrase it?