apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.06k stars 14.29k forks source link

Papermill provider enable logging from notebook to be visible in task's log. #35408

Open wno-xyt opened 1 year ago

wno-xyt commented 1 year ago

Description

Function execute_notebook in papermill accepts named parameter log_output: execute_notebook(nb, kernel_name, output_path=None, progress_bar=True, log_output=False, autosave_cell_every=30, **kwargs)

Unfortunatelly currently the interface of the provider package does not allow setting it to True. It would be great to have this possibility to have logs from executed notebook visible in airflow's task log. If I understand correctly how this works setting log_output=True would just make papermill to use configured logger (in this case airflow's) for the output of the notebook.

Use case/motivation

I think it would be nice to have logs from notebooks execution visible in airflow task log to be able to:

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

boring-cyborg[bot] commented 1 year ago

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

Taragolis commented 1 year ago

AFAIK, there is no papermill users among active Airflow maintainers, so I would recommend to check it by your own and make a PR with changes, otherwise it might take unpredictable time to implement it better could we do it mark this as good first issue

Some useful links

potiuk commented 1 year ago

FWIW cc: @bolkedebruin I think might have more insights on Papermill operator, and it's usage (and from what I remember I believe it's generally unusable sentence comes to my mind.. But I might as well misunderstood it.

bolkedebruin commented 1 year ago

When, I added the PapermillOperator we were experimenting with it to allow our data scientists to become more productive as in being able to schedule experiments faster. I think that the world has chosen to have/keep its ML experimentation to mostly elsewhere. However, the notebook idea is still very alive with the likes of Databricks and there have been recent updates to the Papermill repo.

It might just be that the Airflow audience typically doesn't like notebooks and the audience that does typically does not go to Airflow. That might be due to the fact that the PapermillOperator isn't well documented and does not have great examples. In other words, the PapermillOperator needs some love.

So, I would say not unusable (it does support python 3.12 now @potiuk ) but not well groomed :-).

Just a wild thought: It would be fun if we could read DBC and run that, which would look like Papermill but not exactly.