aws / graph-notebook

Library extending Jupyter notebooks to integrate with Apache TinkerPop, openCypher, and RDF SPARQL.
https://github.com/aws/graph-notebook
Apache License 2.0
735 stars 168 forks source link

Run via Papermill w/store-to fails #430

Open jklap opened 1 year ago

jklap commented 1 year ago

Tried to run a Notebook w/Papermill that uses %%gremlin --store-data results but the following cells fail with a name 'results' is not defined. The Notebook works just fine when executed manually within JupyterLab.

Not sure where the issue lies, within Papermill or graph-notebook so starting here.

Papermill isn't a requirement if there is a better way to run a notebook and save the results

michaelnchin commented 1 year ago

Could you clarify the intended end goal of using Papermill here? Are you looking to save the results of a single %%gremlin query to a local file?

jklap commented 1 year ago

Hi @michaelnchin sorry for the delay -- yes, using --store-to -- was going from memory even if I got it right in the title but then wrong in the bug of this issue :(

We have several notebooks that run several queries (in a single notebook), save the results to variables and then run the results through Pandas Dataframes for some "ETL" at which point we write out both the resulting Notebook for later viewing AND also push some of the results out to Prometheus' Pushgateway so we can embed data into Grafana. These Notebooks work just great when executed manually in JupyterLab -- it's only when we tried to run it via Papermill that it failed.

We picked Papermill because we need to schedule these to run on a regular basis, ie daily, and Papermill's parameter functionality along with the "save-cell-on-execute" has been very useful.

If there is a better tool for this, we are certainly open to input; one additional caveat though is that we are also using Airflow for execution, ie via https://airflow.apache.org/docs/apache-airflow-providers-papermill/stable/operators.html as that is our standard execution engine (caveat though this error that I ran into was running manually via the CLI not using Airflow yet).

Papermill also supports writing to S3 which is a use-case we are working on refining -- ie nightly jobs that execute against Neptune and then write the results to S3 for another team to pick up -- this is not a hard requirement tho because of course there are other tools such as awswranger or boto3 or even the AWS CLI that can solve for it with low overhead.

bechbd commented 1 year ago

@jklap One option you could also explore is the AWS SDK for pandas(https://aws-sdk-pandas.readthedocs.io/en/stable/api.html) which supports both Neptune and S3. You can use this to execute queries from Neptune (https://aws-sdk-pandas.readthedocs.io/en/stable/tutorials/033%20-%20Amazon%20Neptune.html) which returns a Pandas DataFrame and then save that data to S3 (https://aws-sdk-pandas.readthedocs.io/en/stable/tutorials/003%20-%20Amazon%20S3.html)

jklap commented 1 year ago

@bechbd yes-- that is the awswranger that I mentioned. But, really, doesn't solve for the core problem of scheduling notebooks -- users are used to creating notebooks and executing queries with graph-notebook. I mentioned S3/etc simply to better describe the scope of functionality we've been looking at using with Papermill to better help clarify needs for any other suggestions.