Open makkus opened 10 months ago
process
methodBecause import statements -- esp. for some of the scientific Python packages -- can take up a lot of time, it's best to not put import statements at the top of a Python module, but only within the process
method (or any other method of a class if applicable). This way, the import is only called if it is actually used. Otherwise, even a kiara --help
could take an extra second or two.
Imports for Python standardlib or other utility functions can be put on the top, but it'd be best for you to make an informed decision and judge whether an import would take a long time (for example, in some cases the actual import would be fast, but the module that gets imported would import something heavy).
some useful insights are available via the review of the following PR: https://github.com/DHARPA-Project/kiara_plugin.topic_modelling/pull/4
I am trying to recap some of the points encountered via this first review (more best practices items might come over time and as more modules and reviews take place),
please correct me if needed @makkus
recommended packages for data processing are polars and duckdb (instead of pandas)
inputs need to be checked within the module with error handling, documentation of the expected input within module doc is not enough
automatic checks (like mypy) need to be resolved if they don't pass (a dedicated doc item to explain about these automatic checks may be useful (reminder added in #25 )
write unit tests when necessary (for example: when there is a statistical operation to test) process not documented yet in detail via the test-related discussion item (#19), but it will be as soon as the use case is encountered
write "end-to-end module tests": this part is currently being documented via #19
inputs need to be checked within the module with error handling, documentation of the expected input within module doc is not enough
It's always good to do that, yes. Technically, kiara will check that the data type is correct (and the value is set if it is marked 'required'), but there are a lot of cases where the generic validation that kiara can perform are not sufficient. So, in short: always make sure that you are getting valid input for your specific task at hand.
thanks, I forgot to add something that was already discussed, but I don't remember what you said about type hints: do they need to be included going forward in the context of module creation? if yes could you specify in which form (e.g. Pydantic)? if they are not there, the checks will fail is that correct?
Well, in my opinion code in kiara plugins should be type-hinted, linted, etc. Since it's re-useable code, code quality requirements should be higher than what makes sense for Jupyter notebooks for example. But that's just my opinion.
For now, I've enabled both ruff and mypy checks by default whenever code is pushed to the 'develop' & 'main' branches on Github. I can disable them if that is the consensus.
Pydantic is a dependency, and it's using Python type hints at runtime to do some validation/schema-related stuff. That's different to what we are talking about here, mypy, which does run checks on the code and makes sure that type-hinted Python code is not doing anything it is not allowed to do. In my experience, it's a fairly good failsafe against writing stupid code, and it helped make my code better for sure.
I can't really give an introduction to Python type hints and using mypy here, but there are really good resources out there, like:
Also, ChatGPT can help you with examples, errors, etc.
In the context of kiara plugins, everything should be setup, so as long as you have the dev_utils
extras installed (pixi run install-dev-env
before pixi shell
, or if using pip pip install -e '.[dev_utils]'
) you should be able to do a mypy src/<folder>
and get your code type checked.
If I remember right, I set the mypy config to 'strict', so yes, it will fail if it can't determine the variable type automatically and no type-hint is provided. In a lot of cases it can be fairly smart about it though, and figure it out itself.
Thanks,
Concerning the consensus, hints were discussed in #10. I had understood that hints were not to be made into a best practice item, but from more recent discussions on Slack, it seemed that it changed, this is also why I am asking here.
So, to recap: type hints are now recommended as a best practice and checks happen via mypy.
As I said, I can disable it if people think that is the right way to go about it. Should I?
As for the previous discussion, I think we only talked about type hints in the context of code examples in documentation, and I agree its a good idea to leave them out there.
Having them in re-usable plugin Python code is different though, and would need different considerations IMHO.
Ah yes indeed, perfect then, all is clear.
After doing some code reviews, I realized there is no information about how kiara does input validation, and what type of things a module developer should validate themselves. So I'll try to summarize the situation here a bit, feel free to add your own experiences or tell me if something is unclear/doesn't make sense.
Basically, the first step that happens is the module developer creating the inputs schema. They specify the name of the input field, the type of each input, the documentation for the field (that is not important in this context), whether it has a default value if the user doesn't provide anything, and whether the field is 'optional'. Those last two interact in a way (assuming the user didn't specify an input):
is_set
propery is True
(in code, something like: if value.is_set:
)is_set
is False
)is_set
is True
The second important thing to be aware of is that the inputs
value in the process
method is a map with the field name (string) as key, and a Value
instance as value. This Value
Python class wraps the actual data, in order to be able to record and display metadata, and to give the option to only load the actual data into memory bit by bit, or when needed. The actual (wrapped) data object can be accessed via the data
property. You can get a single input value by using:
my_table = inputs.get_value_obj('table') # or whatever the field name is
You can get the wrapped data directly by either doing:
my_table_data = inputs.get_value_data('table')
Alternatively, as I said, you can use the data
property (equivalent to above):
my_table_data = my_table.data
The type of my_table_data
depends on the implementation of the (kiara) data type, in the case of 'table' it would be a KiaraTable
instance. You can get that information using the cli:
kiara data-type explain table # or whatever data type you are interested in
Just be aware that the value is not the same as the value data.
All this has (hopefully fairly straightforward) implications on what the module developer needs to test. I'm going to split up the cases depending on what is specified in the schema for a field (in all cases, if there is a value, the developer can assume it is the correct type, so no need to check that):
The developer can assume that the field is set, otherwise kiara would have already thrown an exception.
The developer can assume that the field is set, because if no user input happened, then kiara would have just used the default.
Here, the developer needs to do the check themselves, something like:
table = inputs.get_value('table')
if table.is_set:
column_names = table.data.column_names
...
else:
column_names = []
...
(what exactly happens depends on the module of course, this is just to demonstrate)
If you have an optional field with no default, you should in most cases feel the need to have a conditional like this, if you don't have one, ask yourself if you chose the right schema. In all other cases, no is_set
check is needed, and you can always assume the field is set one way or another.
Sidenote: if is_set
returns False
, that means that the value.data
property will return None
, which should make it clear why you need that test, since None
will never have the methods you'd like to access on your actual data.
This also means that you can always use an override type hint like:
table_data: KiaraTable = my_table.data # type: ignore
Because we already established that the data
property won't be None
.
There is a second validation step after this basic one where kiara can take over some of the work: validating that the value (which we assume is set from here on out) is in a form that makes sense for the module.
For example if you expect a table to have a specific column, you need to check that it is available, and raise an Exception if it is not, ideally with a helpful error message so the user knows what exactly is expected. Or if you have a string, you might need to test if it's an empty one. Stuff like that, which is not really different to what you have to do in 'normal' Python code anyway.
Kiara can't test this, because it doesn't know what it should test for. There is a possibility to specify the data type in more detail so that kiara can take over some of that validation, so keep that in mind and let me know if you ever think this would make your code significantly more streamlined. For now, I don't want to overcomplicate things, doing it like this is in most cases sufficient.
Thanks for the additional info about data types. I had indeed misinterpreted the previous comment when it was said:
So, in short: always make sure that you are getting valid input for your specific task at hand
Concerning type hints, do we need to specify them when we know them from the module's input specification?
In that case you only know the 'kiara data type', not the Python one. And type hints only relate to Python ones. This is where the critical thing happens:
table = value.data
(obviously this would apply to every other data type in the same way)
If you look up the data
property of the Value
class, you'll see that I marked it as Any
, because the data can be of any type, depending on what is configured in create_inputs_schemas
, and the type checker has no way of knowing what that is. So ideally, you'll help it by giving it a hint what you expect the value to be:
table: KiaraTable = value.data
From now on the type checker assumes it's an instance of KiaraTable
, and if you call a method it doesn't recognize from this class, it will complain. So, yes, in short, it's good practice I think. Compare that with this case:
table_value = inputs.get_value_obj('table')
Here, the type checker knows for sure that table_value
has the Python type Value
, because that is the return type of get_value_obj
, so you don't need to annotate table_value: Value
, since the type checker already knows. Doesn't hurt to do it, but it's not necessary. I still sometimes do it, just to make myself aware of what Python type I'm dealing with.
Thanks, I am not sure to understand how helpful it is to add a type hint that specifies that the value is a KiaraTable, since a KiaraTable can be a variety of things. Wouldn't it be more helpful to specify what kind of table is needed (polars/arrow etc) instead, since this is what gives information about what is expected down the road in a given module? (but doing so would clash with a recommendation in the latest code review comments)
Sorry for the misunderstanding, it's specifically the Python class KiaraTable
that I linked somewhere above. It is always that (if you specifiy table
in create_inputs_schema
), it can't be a variety of things. And from the KiaraTable
you can sorta 'export' a variety of different Python classes that all represent tabular data in their own way, as a convenience. Type hinting / the type checker is only ever concerned about actual Python classes, it does not know anything about kiara or its concepts. Only about the classes that are used in its code.
Basically, the data
attribute of a Value
instance will always be KiaraTable
if you spefieid table
. It would be a bool
if you specified boolean
, or KiaraArray
if you specify array
, there is always a one-to-one mapping.
You can see the available data-types with kiara data-type list
, and more importantly, which actual Python class the data
attribute would be with kiara data-type explain table
(or kiara data-type explain boolean
, or kiara data-type explain file_bundle
, etc. (the Value class
entry in that output).
Ok, I see, Then probably, there would be simply a two-step process in terms of type hints, first the KiaraTable type hint (which is already self-explicit from the input type doc, so hint optional) and then the specific table format used in the rest of the module's code.
table: KiaraTable = value.data
table_polars_format: pl.DataFrame = table.to_polars_dataframe()
Yes, it's a two step process, but it's the other way around to what you wrote. Maybe think of it from the perspective of the type checker. That's a 'pure Python' thing, with no knowledge of kiara, so it has no idea that the table
you specify in create_inputs_schema
will result in a KiaraTable
Python class. There is just no way it could, right?
The only way the type checker can know what type a variable is supposed to be is either by:
The latter is when you as a user doesn't have to worry, because whoever wrote the method you are using did the work for you (by specifying the output type). Some libraries don't have type hints, in which case you have to do the work manually still. Or the result type is of type Any
or a Union
(Optional
is also technically a Union
). This is where as a user you can either decide to:
Optional
, something like if var is None: raise Exception()
. Or you do an isinstance
, or whatever els makes sense in the specific circumstance you are in (the type checker is smart enough to disregard the None
of an Optional/Union[None, OTHER_TYPE]
after such a test, and assumes it deals with the other type. Same if you do an isinstance
.Long story short, because the data
attribute is typed as Any
, and the to_polars_dataframe
as pl.DataFrame
, this would be the minimum you could get away with in terms of manually specifying the type hints:
table: KiaraTable = value.data
table_polars_format = table.to_polars_dataframe()
(for all practical purposes, this is equivalent to what you had)
If you didn't specify the KiaraTable
hint, the type checker could not be sure that the to_polars_dataframe
method is available, since Any
doesn't have that.
Alright, yes, thanks for clarifying, type hinting is optional in the second line of the code example and not the first since the type is clear from .to_polars_dataframe()
.
All is clear I think now (we'll see in the next review round :) ). Thanks!
(I thought that it was the other way around, assuming the input documentation specified the type and thought this made it unnecessary for type hinting / redundant with type indicated in input doc)
Yeah, fair enough, don't blame you to get confused, lots of types of different types floating around :)
Since there is no other place for this yet, I'll use this issue to add all the best practices (and other tips and tricks) I can think of for developing kiara modules here. This can then be either massaged in its own doc page, or maybe a FAQ-type page.