CityOfLosAngeles / aqueduct

A shared pipeline for building ETLs and batch jobs that we run at the City of LA for Data Science Projects. Built on Apache Airflow & Civis Platform
Apache License 2.0
21 stars 6 forks source link

Civis notebook env #336

Closed tiffanychu90 closed 4 years ago

tiffanychu90 commented 4 years ago

Some thoughts on clunkiness in the notebook environment within Civis:

  1. Notebook takes a long time to start up, often terminates because it says it's deployed, but isn't actually deployed.

  2. Unclear workflow in creating a notebook within Civis, and pushing it to GitHub I can connect it to a GitHub repo, branch, and file, but do I have to work with existing files on GitHub only? Can I create a new notebook in Civis and push it to GitHub? How do you take a notebook out of Civis (similar to how you drag a file to another location locally)?

  3. Unclear where files export to. Where do you go to access files that are exported? What is the directory structure? One workaround seems to be pushing it to S3? This works better for tabular than geospatial files. In one notebook, I want to create a zipped shapefile and upload it to S3. But, geospatial files must be created locally and then uploaded to S3; unlike tabular data, the files can be exported to S3 directly. Without knowing the "local"/cloud directory structure, I can't upload geospatial files to S3.

  4. Sharing is confusing. Container scripts have been shared with making other users managers, running a Sharebot through, and others can run the container script. Applying that workflow to notebooks doesn't seem to work. Made another user a manager on a notebook, ran a Sharebot through the notebook to make that user a manager, but other user still can't interact with the notebook. Instead, it seems easier for other users to collaborate or checkout a notebook using our local environment and checking out the file from GitHub.

hunterowens commented 4 years ago

additional thought: The Notebook interface clone doesn't clone the entire repo, just the .ipynb file.

hunterowens commented 4 years ago

notebook interface does in fact clone entire repo

tiffanychu90 commented 4 years ago
  1. Notebook takes a long time to start up, often terminates because it says it's deployed, but isn't actually deployed. This is because Civis's own docker images are preset, then pull from Amazon EKR registry, not Docker Hub, where ours are.

  2. Unclear workflow in creating a notebook within Civis, and pushing it to GitHub I can connect it to a GitHub repo, branch, and file, but do I have to work with existing files on GitHub only? Can I create a new notebook in Civis and push it to GitHub? How do you take a notebook out of Civis (similar to how you drag a file to another location locally)? Can create branch in GitHub, then write new file to that new branch.

  3. Unclear where files export to. Where do you go to access files that are exported? What is the directory structure? One workaround seems to be pushing it to S3? This works better for tabular than geospatial files. In one notebook, I want to create a zipped shapefile and upload it to S3. But, geospatial files must be created locally and then uploaded to S3; unlike tabular data, the files can be exported to S3 directly. Without knowing the "local"/cloud directory structure, I can't upload geospatial files to S3. Use terminal to see the path. It does clone entire repo, so we can access utils, but must push to GitHub, because when we shut down server, those local files are lost.

  4. Sharing is confusing. Container scripts have been shared with making other users managers, running a Sharebot through, and others can run the container script. Applying that workflow to notebooks doesn't seem to work. Made another user a manager on a notebook, ran a Sharebot through the notebook to make that user a manager, but other user still can't interact with the notebook. Instead, it seems easier for other users to collaborate or checkout a notebook using our local environment and checking out the file from GitHub. Workflow should be: share the notebook object in Civis, then other user would then clone that object and work on their own version of the notebook. We probably ran into 2 users access the same notebook, and both are starting servers, which causes notebook to be constantly overwritten.

hunterowens commented 4 years ago

we should break out 1 into it's own issue, 2 is a bug report in Civis that they are fixing 3/4 are documented.