GispoCoding / eis_toolkit

Python library for mineral prospectivity mapping
https://eis-he.eu/
European Union Public License 1.2
19 stars 7 forks source link

Miscellaneous updates to CLI #279

Open nmaarnio opened 9 months ago

nmaarnio commented 9 months ago

I have been thinking about a few updates to the CLI:

nmaarnio commented 9 months ago

@nialov any ideas or opinions on these?

nialov commented 9 months ago

Separating commands into a few files. These files could be the same as the top-level folders of EIS Toolkit, such as raster_processing, prediction etc. The cli.py is getting very large and this could improve maintainability.

Personally I would keep all the similar command-line functions in cli.py so they are all easy to find. You could maybe add an e.g. cli_utilities.py file for example for the INPUT_FILE_OPTION and other definitions if you want to reduce the file size.

Creating command groups that would follow the categories mentioned above. This could potentially make it nicer to use the CLI directly, but perhaps make it slightly more complicated to use from the plugin. Might just be redundant modification.

I think how each function fits into the categories will not be obvious to just any user so they might have trouble finding a certain function if they do not know what category it fits into. Difficult to say personally if this would make things clearer or not.

Help texts for the CLI function parameters. Waste of time to include or no?

By this you mean the help argument in typer.Argument and typer.Option? Using this is the absolute best way to document what each command-line argument and options does for command-line users. But when used from the plugin the user obviously has no need and/or access to these. Including is not a waste of time but maybe not a priority especially as things change quite fast still.

Typer supports CLI arguments with any number of values, but Typer option allows only a fixed number of values. Some EIS tools take in N number of input rasters, and this can be handled by using a Typer list argument (while all other parameters are options). However, when a tool takes in multiple lists, we run into problems (unless I have misunderstood something). One example of such a tool is single_ilr_transform (takes in two lists of column names). What comes to my mind is that lists could be delivered as one string with a defined separator.

You can do arbitrary number of values with typer.Option:

@app.command()
def hey_there(
    random_words: Annotated[List[str], typer.Option()],
    ):
    print(len(random_words))
    for word in random_words:
        print(word)
python3 -m eis_toolkit hey-there --random-words=hey --random-words=there

It is a bit cumbersome with adding the option before each value but when done from the plugin it is obviously automated.

This one goes somewhat beyond just the CLI, but currently the progress updates (that are mainly relevant for processing algorithms in QGIS) are sent in a few batches inside the CLI function. Optimally, we could deliver updates in finer increments while the actual EIS tool is executing. Any ideas how to achieve this or other comments?

I think the progress prints (typer.echo("Progress: **%")) currently clutter the code and are not that useful. I think printing (or logging) what is the exact process being done at any point in execution is more useful. E.g.

with rasterio.open(input_raster) as raster:
    data = raster.read()  # NOTE: Overlays take in data while for example transforms rasters, consistentency?
    # typer.echo("Progress: 25%")
    typer.echo(f"Read raster from {input_raster}")
    out_image = and_overlay(data)
    out_meta = raster.meta.copy()
    out_meta["count"] = 1
typer.echo("'And' overlay completed")
typer.echo("Writing raster to {output_raster}")

Providing updates in finer increments requires there to be a known number of steps at the cli.py level which is a bit cumbersome in Python. If the number of steps are known and they are iterated through then there are progress bars maybe even in typer. What could be done is use logging to log meaningful progress steps within all the functions and these could be displayed for the user. E.g.

for i in range(0, len(bands)):
    logging.info(f"Processing band {i+1} out of {len(bands)}")
    band_array = raster.read(bands[i])

This kind of logging is quite valuable for debugging when inevitably there will be bugs to hunt.

nmaarnio commented 9 months ago

Thanks for the comments!

Personally I would keep all the similar command-line functions in cli.py so they are all easy to find. You could maybe add an e.g. cli_utilities.py file for example for the INPUT_FILE_OPTION and other definitions if you want to reduce the file size.

:+1: Yes, I like the idea of cli_utilities.py. Do you mean easy to find for developers or in what way? I think it is logical that the CLI functions would be found under similarly named categories (files) as the toolkit functions since they map one-to-one. But, I think one file is ok too.

I think how each function fits into the categories will not be obvious to just any user so they might have trouble finding a certain function if they do not know what category it fits into. Difficult to say personally if this would make things clearer or not.

Do you think we should also make some changes to the categories in EIS Toolkit? People using the toolkit in their scripts and programs will need to find the tools under some categories anyway, and plugin users will need to find the processing algorithms under some categories too. Somewhat related to this, do you agree with the proposal in issue #239 ?

By this you mean the help argument in typer.Argument and typer.Option? Using this is the absolute best way to document what each command-line argument and options does for command-line users. But when used from the plugin the user obviously has no need and/or access to these. Including is not a waste of time but maybe not a priority especially as things change quite fast still.

I agree with you that the it is the best way to document the CLI for people using it directly. I was just pondering if it makes sense to write all the help texts and keep them up-to-date as there will be changes to toolkit function parameters. But this is indeed something that can be decided and added later.

You can do arbitrary number of values with typer.Option:

Didn't know this, thanks! This will do the trick for us, as plugin usage is the main focus.

I think the progress prints (typer.echo("Progress: **%")) currently clutter the code and are not that useful. I think printing (or logging) what is the exact process being done at any point in execution is more useful. E.g.

I agree that they do not look pretty in the code, but some algorithms might take long time to execute in the plugin, and if users don't get any updates (and the progress bar stays at 0% all the time), they might think nothing is happening. Maybe the descriptive status updates could work too, or we could leave all progress communication out even if I don't like that idea.

One idea I had in mind was to add progress logging into all existing toolkit functions. There could be an environment variable that controls whether these progress updates are printed (and the env variable would be set on in the beginning of CLI functions). However, this would be a fair amount of work and would clutter the toolkit code.

nialov commented 9 months ago

👍 Yes, I like the idea of cli_utilities.py. Do you mean easy to find for developers or in what way? I think it is logical that the CLI functions would be found under similarly named categories (files) as the toolkit functions since they map one-to-one. But, I think one file is ok too.

My reasoning in keeping things in cli.py (or alternatively a cli directory if multiple files are needed) is that cli.py represents the command-line of the whole eis_toolkit. If we created separate command-line tools with widely different functionality then I would agree with creating separate cli.py files for each directory (raster_processing, etc.). But right now eis_toolkit command-line offers a single interface through which the different functions, that all pretty much have an input data -> output data workflow, are accessed. Code in e.g. raster_processing should in my opinion represent the library functionality (data processing) rather than software interface stuff.

Do you think we should also make some changes to the categories in EIS Toolkit? People using the toolkit in their scripts and programs will need to find the tools under some categories anyway, and plugin users will need to find the processing algorithms under some categories too

You have a good point that they will be anyway categorized using the directory names elsewhere. But for the command-line, if you put a function behind a typer group, you will not see the function when you run eis_toolkit --help, rather you will only see the groups. If you are searching for a single function, it might be bothersome trying to guess under which category it is (you have to run eis_toolkit raster-processing --help to see the functions in that group).

But as I said: Difficult to say personally if this would make things clearer or not. There are pros and cons. If the number of functions grows large, which it might, then I suppose it will be clearer with the groups.

One idea I had in mind was to add progress logging into all existing toolkit functions. There could be an environment variable that controls whether these progress updates are printed (and the env variable would be set on in the beginning of CLI functions). However, this would be a fair amount of work and would clutter the toolkit code.

Properly done logging code can replace commenting in code as you can, instead of writing comments, write in the log message the important details happening in the code. I am going to wager a guess that a lot of debugging will happen at some point (hopefully) for which this kind of logging is invaluable to quickly find e.g. logical errors in code.

As a side note on the principle of minimal code comments that has been to some extent enforced in eis_toolkit: I think the idea of minimal code comments is a nice idea but, as you have probably noticed, due to the wide variety in coders, styles people use the different libraries and general complexity of the subject, much of the code of eis_toolkit is quite difficult to follow. If the code needs to be revisited at some point for some reason, I think augmenting it with logging would not be a bad idea.