cnoe-io / backstage-terraform-integrations

4 stars 5 forks source link

problem running `create data on eks - jupyterhub` template #7

Open Analect opened 4 months ago

Analect commented 4 months ago

@elamaran11, I was interested in running through this, having seen you present your work on the cnoe community meeting.

Just a quick heads-up, there's a typo in the repo README. Somehow that g on the end of export GITHUB_APP_YAML_INDENTED=$(cat ./private/github-integration.yaml | base64 | sed 's/^/ /')g should not be there.

I was able to get the new templates added and decided to run create data on eks - jupyterhub to test. It took some time and at the end it somehow failed, per the logs below. Any thoughts on what might have gone wrong.

Outputs:
configure_kubectl = "aws eks --region eu-west-1 update-kubeconfig --name jupyterhub-on-eks"
SUCCESS: Terraform apply of all modules completed successfully
2024/05/04 15:00:22 finished running script
2024/05/04 15:00:22 saving TF state to k8s secrets, dir=/src/data-on-eks/ai-ml/jupyterhub
2024/05/04 15:00:22 getting eks info from TF state
2024/05/04 15:00:22 failed to describe cluster: operation error EKS: DescribeCluster, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region
INFO[2024-05-04T15:00:22.818Z] sub-process exited                            argo=true error="<nil>"

The full logs are available here: https://gist.github.com/Analect/1b6e63c9447a1fdb228ccc4bb7245edd

These terraform scripts end up creating alot of resources on AWS, and it was somewhat painful to have to manually remove them ... it would be great to have a capability to reverse / delete a deployment from within backstage. I know there is a cleanup.sh script, per the end of this README - https://awslabs.github.io/data-on-eks/docs/blueprints/ai-ml/jupyterhub ... but it's unclear how one would initiate that within a backstage context.

Thanks for your efforts.

elamaran11 commented 4 months ago

@Analect First of all, thankyou so much for try this out, appreciate the same and also the issue. The behaviour above shown in your logs is a known behavior and this means your Jupyter environment creation is successful and good to go but there is a minor error at the end of our tf-manager which is causing this issue but the create is completed successfully. We are working to replace the tf-manager with an alternative terraform controller so we have not fixed this know bug actively. Totally agree on your feedback and we are working on a post delete hook to delete out the created resource via terraform via cleanup.sh. We are also working a feature to update an existing deployment, stay tuned. Thanks again, please keep this issue open.

@nimakaviani