caracal-pipeline / stimela

Stimela 2.0
GNU General Public License v2.0
5 stars 3 forks source link

Automatically parse missing as None in clickify_parameters? #329

Open landmanbester opened 2 weeks ago

landmanbester commented 2 weeks ago

When defining inputs to be parsed by clickify_parameters such as

inputs:
  foo:
    dtype: int

this will get parsed as an UNSET value. In pfb I get around this like so and I now realise you can also get around this by making parameters optional with no default (eg. Optional[int] above) but should it perhaps be the default behaviour?

landmanbester commented 2 weeks ago

A related question is how to UNSET a value from the CLI when it has a default but you want it evaluated to None. Is there a way to do that? One use case I have run into is when you want to use the FLAG column by default but ignore it some of the time (eg. spectral line imaging)

o-smirnov commented 2 weeks ago

Would flag: Optional[str] = "FLAG", and --flag None not work? Hmm, maybe it'll get parsed to a "None" string. It's a bit of a tricky problem in CLIs in general.

landmanbester commented 2 weeks ago

Yeah --flag None gets parsed as a string

o-smirnov commented 2 weeks ago

Well this issue is really in the domain of CLIs, as far as I'm concerned. So let's think of what makes sense from a CLI point of view. A --no-flag option? With a corresponding policies field of none_negates_option to tell Stimela to use this when the option is a None? That said, I'm not even sure this is something that click can express easily in the first place...

If this is not a common use case, it might be simpler to explicitly check for a "" value for the parameter inside the Python code, and treat "" as no flag column. Then you have a clean way to disable the flag column from both YaML and the CLI, no changes to the Stimela codebase necessary.

landmanbester commented 2 weeks ago

It is not that common and I have indeed gone with the latter option but that only works for string inputs. I was just wondering is there is a cleaner way to do this eg. something like --parameter None or --parameter UNSET where the string conversion gets handled in clickify_parameters. But this is a bit orthogonal to the current issue.

I think it makes sense for clickify_parameters to evaluate unset parameters to None. Surely that is the intended behaviour for parameters defined without defaults? I also noted that, and this may just be something I am doing wrong, even with the Optional[dtype] definition in my yaml files I still end up with the _UNSET_DEFAULT values getting parsed. Am I doing something wrong here? Here is sample of the defaults that get parsed for a worker if I remove the last few lines handling the conversion

<UNSET DEFAULT VALUE>
<UNSET DEFAULT VALUE>
I
<UNSET DEFAULT VALUE>
True
True
<UNSET DEFAULT VALUE>
1
<UNSET DEFAULT VALUE>
True
<UNSET DEFAULT VALUE>
<UNSET DEFAULT VALUE>
<UNSET DEFAULT VALUE>
<UNSET DEFAULT VALUE>
<UNSET DEFAULT VALUE>
False
<UNSET DEFAULT VALUE>
DATA
<UNSET DEFAULT VALUE>
<UNSET DEFAULT VALUE>
FLAG
<UNSET DEFAULT VALUE>
-1
32
double
1.0
3.0
<UNSET DEFAULT VALUE>
1
<UNSET DEFAULT VALUE>
True
False
landmanbester commented 2 weeks ago

I can get this working like so but I am not sure if it will have unintended consequences elsewhere. Also not sure if this makes sense for positional arguments. What do you think @o-smirnov ?

JSKenyon commented 2 weeks ago

Random question, but does click let you do something like --parameter1 --parameter2 value where parameter1 should take a value? I think that that currently results in a None in QC.

landmanbester commented 2 weeks ago

Nope. I get eg. Error: Option '--flag-column' requires an argument. Not sure if it works for positional arguments but not for Options

o-smirnov commented 2 weeks ago

In principle, parameters without a specified default are supposed to be caught here: https://github.com/caracal-pipeline/stimela/blob/master/scabha/schema_utils.py#L276. The idea is that the click option is then created without a default keyword, so when a CLI option is omitted and click invokes the Python function, that option is not passed to the function call at all. So I don't understand how "" comes through. Maybe the in should be a string comparison rather?

I think we should just hash out the desired logic (maybe on Tuesday) and write it down clearly so we can test against it. Here I'll just list the subtle aspects of the problem:

landmanbester commented 2 weeks ago

In principle, parameters without a specified default are supposed to be caught here: https://github.com/caracal-pipeline/stimela/blob/master/scabha/schema_utils.py#L276. The idea is that the click option is then created without a default keyword, so when a CLI option is omitted and click invokes the Python function, that option is not passed to the function call at all. So I don't understand how "" comes through. Maybe the in should be a string comparison rather?

"" doesn't come through. I have to catch it like this

https://github.com/ratt-ru/pfb-imaging/blob/ce9890ea914a7629227f5a3f6d2a4c0696491ad3/pfb/workers/init.py#L189

Did you see what I did here?

https://github.com/caracal-pipeline/stimela/blob/f5b8c3556f5a089eef30fa942e27e049b21a2de1/scabha/schema_utils.py#L277

Is that what you meant? I now just have this policy in the the config.yaml eg.

https://github.com/ratt-ru/pfb-imaging/blob/ce9890ea914a7629227f5a3f6d2a4c0696491ad3/pfb/parser/init.yaml#L137

When calling a Python function, there is a difference between "pass an explicit None value" and "pass nothing and rely on the function definition to provide its own default". When @landmanbester catches everything in the pfb workers via **kw (which I find a bit naughty, as it means the true function signature is obscured), this becomes the difference between putting a key=None in kw, or not putting the key into kw.

This was an attempt at defining parameters in a single place. I agree that it's not best practice but the alternative is a lot more work and has proven to be more error prone given the number of rapidly evolving workers I am attempting to maintain. There is an obvious downside to this viz. the function doc string is not informative and has to be viewed by running the CLI command with --help. This is an issue I need to address, open to suggestions. But I do want to avoid defining the function signature in multiple places.

o-smirnov commented 2 weeks ago

https://github.com/ratt-ru/pfb-imaging/blob/ce9890ea914a7629227f5a3f6d2a4c0696491ad3/pfb/workers/init.py#L189

Did you see what I did here?

Ah yes. I think this is the right thing to do!

This was an attempt at defining parameters in a single place. I agree that it's not best practice but the alternative is a lot more work and has proven to be more error prone given the number of rapidly evolving workers I am attempting to maintain. There is an obvious downside to this viz. the function doc string is not informative and has to be viewed by running the CLI command with --help. This is an issue I need to address, open to suggestions. But I do want to avoid defining the function signature in multiple places.

Yeah I hear you... it's a pain keeping the YaML and the Python signature in sync with this many parameters. Well, I suppose we can just proclaim the YaML cab definition to be part of the function documentation and leave it at that?

landmanbester commented 2 weeks ago

Yeah I hear you... it's a pain keeping the YaML and the Python signature in sync with this many parameters. Well, I suppose we can just proclaim the YaML cab definition to be part of the function documentation and leave it at that?

Yeah, good idea. Let me see if I can render that to the docstring without breaking something elsewhere. Thanks for the suggestion.

As yes. I think this is the right thing to do!

Actually, I think my logic was flawed there (I missed the not in). I am running into something that confuses me though. In this line

https://github.com/caracal-pipeline/stimela/blob/1c6a973605d611ae6908028ca8c588fe6de2da4b/scabha/schema_utils.py#L190

the default policies will get overwritten by whatever is in schema.policies correct? But at this stage schema.policies refers to the per parameter policies so even if default_policies has pass_missing_as_none=True it will get overwritten by the policy for that parameter, which seems to be coming in as None. Is this the intended behaviour? That would mean that the per parameter policy takes precedence even when it is None. Or am I missing something?

o-smirnov commented 2 weeks ago

Good catch, I think this genuinely a bug! The corresponding cab invocation logic has something more sophisticated than a simple merge, see here: https://github.com/caracal-pipeline/stimela/blob/master/stimela/kitchen/cab.py#L168

I guess the solution is to use that same logic here.

landmanbester commented 2 weeks ago

I had a go in the clickify_missing_as_none branch but no cigar. For some reason parameters with dtype List[URI] do not get parsed properly, will keep digging

landmanbester commented 1 week ago

So the issue was that the pass_missing_as_none policy wasn't propagated to the list and tuple validators. I added a check for this here

https://github.com/caracal-pipeline/stimela/blob/95e4f3f922296f8bafe718f253979f96af1616d9/scabha/schema_utils.py#L128

and similarly for _validate_tuple. I think this now works as expected but need to test it more thoroughly. I've also merged master and profiling-hacks into this branch as I need them for my experiments