aml4td / website

Website sources for Applied Machine Learning for Tabular Data
https://aml4td.org/
Other
71 stars 10 forks source link

python computing supplement #24

Open bmreiniger opened 6 months ago

bmreiniger commented 6 months ago

I'd be interested in helping with a python computing supplement.

Did you have a format in mind? It seems likely that after the setup section, most sections could be tightly coupled between the R and python versions, which suggests maybe having two independent repositories isn't ideal? I think Quarto supports panelsets (as "tabsets"); that strikes me as a nice way to display the two, but also would mean both codes should be updated when a change is made.

One other thing that would be nice to decide on early: which python plotting library to use? plotnine mimics ggplot, matplotlib is already used by sklearn+pandas, others are slicker...

topepo commented 6 months ago

We would make a computing-python repo and keep the same css and organization. The owner of that repo would have to decide on using Jupyter Notebooks or a more markdown approach for the most Pythonic approach.

For libraries... ¯\(ツ)/¯ I'd like to avoid extra complexity but would defer to the python community for those decisions.

mermast commented 6 months ago

I would also like to work on python supplement. @bmreiniger may we collaborate on this?

ddixonAI commented 6 months ago

I'll also toss in my hat for a collaboration on sklearn/python code. Could be a fun project!

lcrmorin commented 6 months ago

I'd like to work on this too. I have a decent knowledge of python for ML (Kaggle notebooks GM). As already mentionned it is difficult to imagine working without pandas sklearn and matplotlib. If plotnine is mentionned to replace matplotlib, I should mention polars that has a grammar closer to the tidyverse and is significantly better than pandas.

mermast commented 6 months ago

There is another plotting option, lets-plot

sulphatet commented 6 months ago

I too would like to work on the python code, @bmreiniger lets collaborate on this?

topepo commented 6 months ago

I'll recant my previous statement:

I'd like to avoid extra complexity but would defer to the python community for those decisions.

Use whatever libraries you see fit. We use a ton of R packages to make the book (that's the way R is); use anything that you think makes the best results.

topepo commented 6 months ago

I suggest creating a starter repo using the structure and styling of the computing-tidymodels repo. Once you get all the Python bits set up, ping me.

topepo commented 6 months ago

Also, I can export the data sets to a more suitable format to Python to ingest. What do you suggest? csv?

bmreiniger commented 6 months ago

I'll probably be more useful on content, but I have a little site deployment experience; when I get some time I'll draft something. If anybody else knows more and/or has more time, jump in. My first thoughts:

  1. Quarto in the same repo as tidyverse coding, with panelsets. I still think this is attractive enough to do a demo of. On the other hand, fully rebuilding the site would require both an R and a python env...
  2. Quarto with qmd files and python snippets. This mirrors the tidyverse version the closest, and styling should be trivially very close as well.
  3. Quarto with ipynb files. Nice that the jupyter notebooks could be downloaded and executed directly, but git diffs will be unpleasant.
  4. sphinx-gallery I think is how sklearn generates its examples. Straight python means easy diffs and easily runnable, markup in comments for text sections. But styling will be harder, I imagine.
  5. ...?

As for data format, csv is probably fine. At least until something comes up to suggest otherwise.

On plotting, I'd lean toward starting out with matplotlib (and using the plotting functionality of pandas and sklearn), and if anyone can make much nicer plots much easier with another package, then make a PR for us all to look at. Similarly, I'd start with pandas, but if @lcrmorin or others can make something look nicer (or much faster, even for the toy datasets I imagine we'll have here?) using polars then let's see that and decide together.

bmreiniger commented 6 months ago

A (very) rough demo for option 1: https://bmreiniger.github.io/aml4td-demo-computing-python/chapters/whole-game.html

ddixonAI commented 6 months ago

I like that! Sphix-gallery from option 3 looks nice as well but this is an area I'm not well versed in so I don't have a strong opinion.

On the subject of plotting libraries, another option I'm fond of is using the Seaborn objects API: https://seaborn.pydata.org/tutorial/objects_interface.html

This allows one to approximate a ggplot-like grammer of graphics using method chaining. As it says in the docs, it's still early in development but might be worth trying out.

topepo commented 6 months ago

We've experimented with side-by-side R/python code and I've never seen it work all that well. I think that it should be Python only.

Based on other things that I've done, many of the people consuming the main site and these computing pages are not going to be well versed in Python or R. We'll need to strike a balance between helpful content for beginners and more experienced readers (including "how to install" docs).

That said, I think that @bmreiniger's options 1 and 2 are good 9but I've never seen Sphix-gallery until now and don't know if that works with Quarto).

topepo commented 6 months ago

The demo looks good!

There are some nice Posit Python packages for tables and interactivity and many others unrelated to Posit (obviously).

Data splitting. sklearn’s train_test_split doesn’t support stratifying on a continuous outcome.

I was asked to discuss a PR or maybe a pip about this pre-pandemic. ¯\(ツ)

There will be a lot of inconsistencies where R or Python have different (or more extensive) capabilities. It doesn't have to be a perfect reproduction of what is on the main site.

lcrmorin commented 6 months ago

Best way to go is usually to stratify by pd.cut(df.target, n_grp, labels=False)... regarding code translation I have found LMMs to be very good at the task. Might be interesting to try this solution.

bmreiniger commented 6 months ago

I think Sphinx would be instead of Quarto. I'd like to put the same sort of demo together for that, but I suspect it'll end up being similar amount of setup/work, with a very slight benefit of being pure .py scripts, and the detriment of being styled very differently from the rest of the project (barring a lot of work in defining a sphinx style/template).

I had some trouble getting renv set up, but now have a working demo of R+python in tabsets. ~Since it's in a branch of this repo, I don't know how to most readily make it viewable;~ you can ~download the html~ view it here. But (1) it requires managing both envs (python inside of reticulate), (2) during render both sets of code run, effectively doubling the runtime and memory usage, and (3) switching between the rendered tabsets make the rest of the page jump around when they're of different length; so I agree with @topepo that it's not worth it.

So it seems approach (1) is probably best, and I'll try to clean it up, complete with a python env. (Maybe I'll still demo sphinx for the sake of having done it.) So, another early question: which environment manager? I'd suggest conda or Pipenv; I find conda more intuitive, and Pipenv more rigorous.

topepo commented 5 months ago

I want to keep the repos on Quarto just so that they are in one format. :-/

You can use Jupyter notebooks or basic Python chunks; you won't need R for anything.

bmreiniger commented 2 months ago

I've got a start in my org, if folks want to collaborate there. Ideally at some point it'd get moved under the aml4td org (with a name change)?
https://github.com/bmreiniger/aml4td-computing-python