distribution over time - Githubissues

MariellaCC commented 10 months ago

Would you have a possibility to provide a feedback on the module concerned by this pr (class CorpusDistTime / "topic_modelling.time_dist")? The goal is to generate the distribution over time for a corpus table. The output of this module will then be used in Jupyter (outside of Kiara, just for display) for visualization and display purposes. Many thanks in advance.

makkus commented 10 months ago

I'll have a look tomorrow. We haven't really talked about or documented the stuff that gets checked when you either use pre-commit, or the automated Github Actions get run. For example, for your commit, those fail:

https://github.com/DHARPA-Project/kiara_plugin.topic_modelling/actions/runs/7252215684

The pytest ones are just because there is an example job under examples/jobs/example_job_***.yaml, which refers to the example module that is included in the template. You can just delete that yaml file (as well as the pipeline, it refers to, since that is also just an example). This should make the tests pass at least.

You could also try to add one or two of your own tests, as we've talked about in one of the issues you opened, I'm happy to help there as well if you have problems, but it'd be good to add any problems you might have there to the issue so we can consider that when documenting that part.

Also, there are some linting issues: https://github.com/DHARPA-Project/kiara_plugin.topic_modelling/actions/runs/7252215684/job/19756206599

I'd recommend looking at those, because they are usually a good first line of defense for spotting any obvious bugs or code style issues. In a lot of cases they are easy to fix, like removing print statements. If you have a reason to keep the code you get a warning for, just use a # noqa comment at the end of the line that the linter complains about, like:

print("hello world")  # noqa

This should make the warning go away when you push the thing next. Btw. those warnings also show up in the 'files changed' tab for the pull request, which I find quite neat:

https://github.com/DHARPA-Project/kiara_plugin.topic_modelling/pull/4/files

For your sql warnings, since you are taking some arguments in your sql query from users, this is generally considered a security issue, which is why you are getting the linting warning. It's not as bad here as it would be in a service that runs on some server, and also those are only query statements, not execute, so it should be save and you can just add # noqa tags to them. If you want the error to go away you can also fix it by using a prepared statement like in:

https://duckdb.org/docs/api/python/dbapi.html

(under Prepared Statements -- it's an easy fix, just pull out the argument variables out of your string template, replace them with ?, and then add them to when your query call (you might have to replace query with `execute, I'm not 100% sure myself about that right now though).

Anyway, that's just a general comment, as I've said I'll check the actual code tomorrow.

P.S. the mypy tests are currently disabled by default, because I am not sure if we want to make sure to have properly type-annotated plugins. I'd vote yes, but it's a bit of extra work. It's one of the best and cheapest ways to ensure correctnes of your code though (up to a point). If you want to enable this, you can uncomment the lines of this step in the ci config:

https://github.com/DHARPA-Project/kiara_plugin.topic_modelling/blob/587f072563898e45bb9b5d3b263b085b420bbc3a/.github/workflows/build-linux.yaml#L54

makkus commented 10 months ago

Another quick comment: this is a bit hard to review, because it only includes the changes from an initial state that hasn't been reviewed itself. If you look here: https://github.com/DHARPA-Project/kiara_plugin.topic_modelling/pull/4/files then you can see that for example the GetLCCNMetadata is compared against the current develop branch, which already contains a lot of code ( https://github.com/DHARPA-Project/kiara_plugin.topic_modelling/blob/develop/src/kiara_plugin/topic_modelling/modules/corpus_preparation.py ), so just by looking at the changes I can't really understand the class because I'm missing a lot of context. So I'll need to have a look at the full file, and can't easily comment inline unless the parts I want to comment on were actually changed.

Would it be possible to revert the develop branch back to the state where it doesn't have any code in it that you want to be reviewed, the dist_over_time branch can stay at the commit it is pointing to atm? That way, all the code would be available to comment on inline...

makkus commented 10 months ago

Actually, ignore my previous comment. I just realized you only wanted CorpusDistTime reviewed. My bad.

MariellaCC commented 10 months ago

https://duckdb.org/docs/api/python/dbapi.html

(under Prepared Statements -- it's an easy fix, just pull out the argument variables out of your string template, replace them with ?, and then add them to when your query call (you might have to replace query with `execute, I'm not 100% sure myself about that right now though).

Concerning duckdb prepared statements: I tried the prepared statement approach, but the problem is that the result contains the parameter literally instead of using the column (designated by the parameter) values:

example:

query = "SELECT EXTRACT(MONTH FROM date) AS month, EXTRACT(YEAR FROM date) AS year, $1, COUNT(*) as count FROM sources GROUP BY ALL"
val = [title_col]
con.execute(query,val)
print(con.fetchdf())

results in:

     month  year        $1  count
0        1  1918  pub_name     39
1       11  1917  pub_name     39
2        5  1919  pub_name     47
3       12  1915  pub_name     38
4       12  1916  pub_name     44
..     ...   ...       ...    ...
392     12  1923  pub_name      4
393      1  1923  pub_name      4
394     12  1925  pub_name      3
395     10  1925  pub_name      5

So I am planning on using #noqa for this, except if you have suggestions.

MariellaCC commented 10 months ago

Would it be possible to revert the develop branch back to the state where it doesn't have any code in it that you want to be reviewed, the dist_over_time branch can stay at the commit it is pointing to atm? That way, all the code would be available to comment on inline...

Yes, I had started the code for this plugin before the team discussions that led to this code review opportunity. I would like to have the previous parts of the code reviewed as well if possible. I will dm on slack to see more in details if/how we could manage this.

MariellaCC commented 10 months ago

There are a few issues I commented on. Overall, I'm not 100% sure I understand the scope of the module. This could be used in other contexts than Jupyter, right?

The module's scope is to let users obtain the data required to visually examine the data distribution before proceeding further. Visual examination of the distribution is a classic step in statistics and data analysis. In the current example, visual examination helps users decide whether or not to create a subset of the data and which boundaries to use for the subset. This could be used in contexts other than Jupyter.

MariellaCC commented 10 months ago

I guess one thing I'm usually most concerned about is that modules don't depend on any random data 'shapes', like here it seems to need a publisher id column, as well as a specific date format. Can we assume that users who want to use this module will always have that data in the right format so it will work with this module? If so, then it's fine. If not, and this only really works with some example data, then the usefulness of this module would be quite limited, and rather frustrating for users, since they see there is a module that can sort of do what they need, but they have no way to use it with their data. Does that make sense?

At the moment, I am not assuming anything since I do not know about the strategy/future plans in the Kiara onboarding plugin. As I can't use the onboarding plugin at the moment (we discussed that in previous dev meetings since the onboarding plugin is not considered stable), I organised things so that the needs of this new topic_modelling plugin are addressed, thinking that some parts would be refactored when there is more information about the onboarding plugin. So it's very difficult for me to assume anything. In the current module, I let users indicate the names of the columns they want to work with (the module doesn't assume any pre-existing column names but rather asks for the names of the columns to use). So I am not sure I understand this point.

makkus commented 10 months ago

As for onboarding plugin, I'd assume that would only ever give you a file, file_bundle, or generic table, every transformation/qualifying step would always be separate. So for your purposes, you can just assume you have a local file, file_bundle, or maybe table that you can prepare manually for development and testing purposes. So I don't really see a holdup there. Just manually download and put any type of expected input file into examples/data, and then run tests agains them. The onboarding plugin would never really be involved here, unless I am missing something.

makkus commented 10 months ago

So I am planning on using #noqa for this, except if you have suggestions.

Interesting, that almost looks like a bug in duckdb, not sure.

DHARPA-Project / kiara_plugin.topic_modelling

distribution over time #4