Best practices when writing a kiara module

makkus commented 8 months ago

Since there is no other place for this yet, I'll use this issue to add all the best practices (and other tips and tricks) I can think of for developing kiara modules here. This can then be either massaged in its own doc page, or maybe a FAQ-type page.

makkus commented 8 months ago

use imports only within the `process` method

Because import statements -- esp. for some of the scientific Python packages -- can take up a lot of time, it's best to not put import statements at the top of a Python module, but only within the process method (or any other method of a class if applicable). This way, the import is only called if it is actually used. Otherwise, even a kiara --help could take an extra second or two.

Imports for Python standardlib or other utility functions can be put on the top, but it'd be best for you to make an informed decision and judge whether an import would take a long time (for example, in some cases the actual import would be fast, but the module that gets imported would import something heavy).

MariellaCC commented 7 months ago

some useful insights are available via the review of the following PR: https://github.com/DHARPA-Project/kiara_plugin.topic_modelling/pull/4

I am trying to recap some of the points encountered via this first review (more best practices items might come over time and as more modules and reviews take place),

please correct me if needed @makkus

recommended packages for data processing are polars and duckdb (instead of pandas)
inputs need to be checked within the module with error handling, documentation of the expected input within module doc is not enough
automatic checks (like mypy) need to be resolved if they don't pass (a dedicated doc item to explain about these automatic checks may be useful (reminder added in #25 )
write unit tests when necessary (for example: when there is a statistical operation to test) process not documented yet in detail via the test-related discussion item (#19), but it will be as soon as the use case is encountered
write "end-to-end module tests": this part is currently being documented via #19

makkus commented 7 months ago

inputs need to be checked within the module with error handling, documentation of the expected input within module doc is not enough

It's always good to do that, yes. Technically, kiara will check that the data type is correct (and the value is set if it is marked 'required'), but there are a lot of cases where the generic validation that kiara can perform are not sufficient. So, in short: always make sure that you are getting valid input for your specific task at hand.

MariellaCC commented 7 months ago

thanks, I forgot to add something that was already discussed, but I don't remember what you said about type hints: do they need to be included going forward in the context of module creation? if yes could you specify in which form (e.g. Pydantic)? if they are not there, the checks will fail is that correct?

makkus commented 7 months ago

Well, in my opinion code in kiara plugins should be type-hinted, linted, etc. Since it's re-useable code, code quality requirements should be higher than what makes sense for Jupyter notebooks for example. But that's just my opinion.

For now, I've enabled both ruff and mypy checks by default whenever code is pushed to the 'develop' & 'main' branches on Github. I can disable them if that is the consensus.

Pydantic is a dependency, and it's using Python type hints at runtime to do some validation/schema-related stuff. That's different to what we are talking about here, mypy, which does run checks on the code and makes sure that type-hinted Python code is not doing anything it is not allowed to do. In my experience, it's a fairly good failsafe against writing stupid code, and it helped make my code better for sure.

I can't really give an introduction to Python type hints and using mypy here, but there are really good resources out there, like:

Also, ChatGPT can help you with examples, errors, etc.

In the context of kiara plugins, everything should be setup, so as long as you have the dev_utils extras installed (pixi run install-dev-env before pixi shell, or if using pip pip install -e '.[dev_utils]') you should be able to do a mypy src/<folder> and get your code type checked.

If I remember right, I set the mypy config to 'strict', so yes, it will fail if it can't determine the variable type automatically and no type-hint is provided. In a lot of cases it can be fairly smart about it though, and figure it out itself.

MariellaCC commented 7 months ago

Thanks,

Concerning the consensus, hints were discussed in #10. I had understood that hints were not to be made into a best practice item, but from more recent discussions on Slack, it seemed that it changed, this is also why I am asking here.

So, to recap: type hints are now recommended as a best practice and checks happen via mypy.

makkus commented 7 months ago

As I said, I can disable it if people think that is the right way to go about it. Should I?

makkus commented 7 months ago

As for the previous discussion, I think we only talked about type hints in the context of code examples in documentation, and I agree its a good idea to leave them out there.

Having them in re-usable plugin Python code is different though, and would need different considerations IMHO.

MariellaCC commented 7 months ago

Ah yes indeed, perfect then, all is clear.

makkus commented 7 months ago

After doing some code reviews, I realized there is no information about how kiara does input validation, and what type of things a module developer should validate themselves. So I'll try to summarize the situation here a bit, feel free to add your own experiences or tell me if something is unclear/doesn't make sense.

Basically, the first step that happens is the module developer creating the inputs schema. They specify the name of the input field, the type of each input, the documentation for the field (that is not important in this context), whether it has a default value if the user doesn't provide anything, and whether the field is 'optional'. Those last two interact in a way (assuming the user didn't specify an input):

if the field is not marked as 'optional' (the default if not specified in the field schema), but a default is set, the input value that arrives in the module 'process' method will have a value (the default), and the is_set propery is True (in code, something like: if value.is_set: )
if a field is marked as 'optional' and doesn't have a default, and the user doesn't provide something, the input value will be 'not_set' (is_set is False)
if a field is marked as 'not optional', and a default is set, and no user input, the default is used (same as with before) and is_set is True
if a field is marked as 'not optional', and no default is set, and no user input, then kiara will raise an error before the 'process' method is even reached, since it's something it can verify independent of what happens within the process method

The second important thing to be aware of is that the inputs value in the process method is a map with the field name (string) as key, and a Value instance as value. This Value Python class wraps the actual data, in order to be able to record and display metadata, and to give the option to only load the actual data into memory bit by bit, or when needed. The actual (wrapped) data object can be accessed via the data property. You can get a single input value by using:

my_table = inputs.get_value_obj('table') # or whatever the field name is

You can get the wrapped data directly by either doing:

my_table_data = inputs.get_value_data('table')

Alternatively, as I said, you can use the data property (equivalent to above):

my_table_data = my_table.data

The type of my_table_data depends on the implementation of the (kiara) data type, in the case of 'table' it would be a KiaraTable instance. You can get that information using the cli:

kiara data-type explain table # or whatever data type you are interested in

Just be aware that the value is not the same as the value data.

All this has (hopefully fairly straightforward) implications on what the module developer needs to test. I'm going to split up the cases depending on what is specified in the schema for a field (in all cases, if there is a value, the developer can assume it is the correct type, so no need to check that):

field is not optional, default is set or not

The developer can assume that the field is set, otherwise kiara would have already thrown an exception.

field is optional, default is set

The developer can assume that the field is set, because if no user input happened, then kiara would have just used the default.

field is optional, no default

Here, the developer needs to do the check themselves, something like:

table = inputs.get_value('table')
if table.is_set:
   column_names = table.data.column_names
   ...
else:
   column_names = []
   ...

(what exactly happens depends on the module of course, this is just to demonstrate) If you have an optional field with no default, you should in most cases feel the need to have a conditional like this, if you don't have one, ask yourself if you chose the right schema. In all other cases, no is_set check is needed, and you can always assume the field is set one way or another.

Sidenote: if is_set returns False, that means that the value.data property will return None, which should make it clear why you need that test, since None will never have the methods you'd like to access on your actual data. This also means that you can always use an override type hint like:

table_data: KiaraTable = my_table.data  # type: ignore

Because we already established that the data property won't be None.

There is a second validation step after this basic one where kiara can take over some of the work: validating that the value (which we assume is set from here on out) is in a form that makes sense for the module.

For example if you expect a table to have a specific column, you need to check that it is available, and raise an Exception if it is not, ideally with a helpful error message so the user knows what exactly is expected. Or if you have a string, you might need to test if it's an empty one. Stuff like that, which is not really different to what you have to do in 'normal' Python code anyway.

Kiara can't test this, because it doesn't know what it should test for. There is a possibility to specify the data type in more detail so that kiara can take over some of that validation, so keep that in mind and let me know if you ever think this would make your code significantly more streamlined. For now, I don't want to overcomplicate things, doing it like this is in most cases sufficient.

MariellaCC commented 6 months ago

Thanks for the additional info about data types. I had indeed misinterpreted the previous comment when it was said:

So, in short: always make sure that you are getting valid input for your specific task at hand

Concerning type hints, do we need to specify them when we know them from the module's input specification?

makkus commented 6 months ago

In that case you only know the 'kiara data type', not the Python one. And type hints only relate to Python ones. This is where the critical thing happens:

table = value.data

(obviously this would apply to every other data type in the same way)

If you look up the data property of the Value class, you'll see that I marked it as Any, because the data can be of any type, depending on what is configured in create_inputs_schemas, and the type checker has no way of knowing what that is. So ideally, you'll help it by giving it a hint what you expect the value to be:

table: KiaraTable = value.data

From now on the type checker assumes it's an instance of KiaraTable, and if you call a method it doesn't recognize from this class, it will complain. So, yes, in short, it's good practice I think. Compare that with this case:

table_value = inputs.get_value_obj('table')

Here, the type checker knows for sure that table_value has the Python type Value, because that is the return type of get_value_obj, so you don't need to annotate table_value: Value, since the type checker already knows. Doesn't hurt to do it, but it's not necessary. I still sometimes do it, just to make myself aware of what Python type I'm dealing with.

MariellaCC commented 6 months ago

Thanks, I am not sure to understand how helpful it is to add a type hint that specifies that the value is a KiaraTable, since a KiaraTable can be a variety of things. Wouldn't it be more helpful to specify what kind of table is needed (polars/arrow etc) instead, since this is what gives information about what is expected down the road in a given module? (but doing so would clash with a recommendation in the latest code review comments)

makkus commented 6 months ago

Sorry for the misunderstanding, it's specifically the Python class KiaraTable that I linked somewhere above. It is always that (if you specifiy table in create_inputs_schema), it can't be a variety of things. And from the KiaraTable you can sorta 'export' a variety of different Python classes that all represent tabular data in their own way, as a convenience. Type hinting / the type checker is only ever concerned about actual Python classes, it does not know anything about kiara or its concepts. Only about the classes that are used in its code.

makkus commented 6 months ago

Basically, the data attribute of a Value instance will always be KiaraTable if you spefieid table. It would be a bool if you specified boolean, or KiaraArray if you specify array, there is always a one-to-one mapping.

You can see the available data-types with kiara data-type list, and more importantly, which actual Python class the data attribute would be with kiara data-type explain table (or kiara data-type explain boolean, or kiara data-type explain file_bundle, etc. (the Value class entry in that output).

MariellaCC commented 6 months ago

Ok, I see, Then probably, there would be simply a two-step process in terms of type hints, first the KiaraTable type hint (which is already self-explicit from the input type doc, so hint optional) and then the specific table format used in the rest of the module's code.

table: KiaraTable = value.data
table_polars_format: pl.DataFrame = table.to_polars_dataframe()

makkus commented 6 months ago

Yes, it's a two step process, but it's the other way around to what you wrote. Maybe think of it from the perspective of the type checker. That's a 'pure Python' thing, with no knowledge of kiara, so it has no idea that the table you specify in create_inputs_schema will result in a KiaraTable Python class. There is just no way it could, right?

The only way the type checker can know what type a variable is supposed to be is either by:

the user specifying it manually, in which case the type checker just trusts that statement
by looking at the result type of the method that produced the value the variable is assigned to

The latter is when you as a user doesn't have to worry, because whoever wrote the method you are using did the work for you (by specifying the output type). Some libraries don't have type hints, in which case you have to do the work manually still. Or the result type is of type Any or a Union (Optional is also technically a Union). This is where as a user you can either decide to:

specify the type hint manually, in which case, again the type checker trusts you and from that point onwards assumes it knows the type
write some code to test the type of the variable, this is most commonly done for Optional, something like if var is None: raise Exception(). Or you do an isinstance, or whatever els makes sense in the specific circumstance you are in (the type checker is smart enough to disregard the None of an Optional/Union[None, OTHER_TYPE] after such a test, and assumes it deals with the other type. Same if you do an isinstance.

Long story short, because the data attribute is typed as Any, and the to_polars_dataframe as pl.DataFrame, this would be the minimum you could get away with in terms of manually specifying the type hints:

table: KiaraTable = value.data
table_polars_format = table.to_polars_dataframe()

(for all practical purposes, this is equivalent to what you had)

If you didn't specify the KiaraTable hint, the type checker could not be sure that the to_polars_dataframe method is available, since Any doesn't have that.

MariellaCC commented 6 months ago

Alright, yes, thanks for clarifying, type hinting is optional in the second line of the code example and not the first since the type is clear from .to_polars_dataframe(). All is clear I think now (we'll see in the next review round :) ). Thanks!

MariellaCC commented 6 months ago

(I thought that it was the other way around, assuming the input documentation specified the type and thought this made it unnecessary for type hinting / redundant with type indicated in input doc)

makkus commented 6 months ago

Yeah, fair enough, don't blame you to get confused, lots of types of different types floating around :)

DHARPA-Project / kiara-website