[Bug][Devlake] panic: invalid encKey

apache / incubator-devlake

Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.

https://devlake.apache.org/

Apache License 2.0

2.6k stars 521 forks source link

[Bug][Devlake] panic: invalid encKey #4545

Closed ankgupta99 closed 1 year ago

ankgupta99 commented 1 year ago

Search before asking

[X] I had searched in the issues and found no similar issues.

What happened

Devlake container exiting in AWS ECS with following error - 28/2/2023, 3:01:27 pm	panic: invalid encKey	devlake
28/2/2023, 3:01:27 pm	goroutine 1 [running]:	devlake
28/2/2023, 3:01:27 pm	time="2023-02-28 09:31:27" level=info msg="plugin loaded zentao"	devlake
28/2/2023, 3:01:27 pm	time="2023-02-28 09:31:27" level=info msg="plugin loaded webhook"	devlake
28/2/2023, 3:01:27 pm	time="2023-02-28 09:31:27" level=info msg="plugin loaded tapd"	devlake
28/2/2023, 3:01:27 pm	time="2023-02-28 09:31:27" level=info msg="plugin loaded starrocks"

What do you expect to happen

Devlake container stopped in AWS ECS with mentioned error

How to reproduce

use the ECS task definition to spin up the containers.

Anything else

No response

Version

main

Are you willing to submit PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

klesh commented 1 year ago

I think we need to add a doc to explain how the encKey works. @ankgupta99 I see you have the "Submit PR" checked, may I ask how will you fix the bug? Thanks.

iholovin commented 1 year ago

Experiencing the same issue with newly created infrastructure (including the DB) with ECS Task for the backend container:

For this one I had empty .env file copied into container:

Interestingly enough: there was no such error for the same exact setup and Postgres Aurora cluster (now I'm using MySQL Aurora)

klesh commented 1 year ago

@iholovin the .env must be copied from the existing container volume if you are going to use the existing database since devlake writes the auto-generated encKey to the file. The error occurred due to devlake trying to decrypt the existing data with the wrong key.

klesh commented 1 year ago

@Startrekzky @likyh We should add a doc to explain the upgrade process.

klesh commented 1 year ago

@ankgupta99 @iholovin May I ask which approach did you use to deploy and upgrade? Was it docker-compose or helm?

ankgupta99 commented 1 year ago

@ankgupta99 @iholovin May I ask which approach did you use to deploy and upgrade? Was it docker-compose or helm?

I took the devlake and config-ui containers details from the docker compose file and then created those 2 containers. For Grafana & mysql the services from aws were used. I was able to get around the error by creating new database.

klesh commented 1 year ago

@ankgupta99 Did you mount the .env file into the containers? the generated encKey should be stored in the file, mounting the same file into the new containers or copying the encKey should fix your problem as well.

uderik commented 1 year ago

Hi, I have the same problem with devlake installed in k8s from helm chart, env file is stored in PVC, and I saw it was updated some days ago; I suspect the key has also been updated. How could this happen, and how can I restore the data now?

klesh commented 1 year ago

@uderik That is weird, devlake it self wouldn't regenerate a new key if it already existed.

Are you using the official helm repo? @warren830 we have to confirm if it works correctly.

warren830 commented 1 year ago

Hi guys, I just upgraded my devlake pods by helm in minikube node, I found that encKey didnt change. I followed this link to upgrade: https://github.com/apache/incubator-devlake-helm-chart#update

uderik commented 1 year ago

I did not update anything. I just restarted all the pods. This is not the first time I've had this situation; I'll look later at what happens there.

klesh commented 1 year ago

I did not update anything. I just restarted all the pods. This is not the first time I've had this situation; I'll look later at what happens there.

Thanks, very much appreciated.

iholovin commented 1 year ago

@ankgupta99 @iholovin May I ask which approach did you use to deploy and upgrade? Was it docker-compose or helm?

Because of the DevOps policies at my company, I had to deploy containers with Terraform and ECS tasks. For the database, I used Aurora MySQL.

It was working before with Aurora Postgres though 🤔

I checked the MySQL logs and discovered the following error corresponding to backend's requests:

Aborted connection 87 to db: 'devlake' user: 'root' host: '10.204.0.214' (Got an error reading communication packets). (sql_connect.cc:845)

Here is what I used for aws_rds_cluster:

  ...

  instance_class  = "db.r6g.large"
  engine          = "aurora-mysql"
  engine_version  = "8.0.mysql_aurora.3.03.0"

And aws_rds_cluster_parameter_group:

  ...

  family = "aurora-mysql8.0"

  parameter {
    name  = "character_set_server"
    value = "utf8mb4"
  }

  parameter {
    name  = "collation_server"
    value = "utf8mb4_bin"
  }

klesh commented 1 year ago

@ankgupta99 @iholovin May I ask which approach did you use to deploy and upgrade? Was it docker-compose or helm?

Because of the DevOps policies at my company, I had to deploy containers with Terraform and ECS tasks. For the database, I used Aurora MySQL.

It was working before with Aurora Postgres though 🤔

I checked the MySQL logs and discovered the following error corresponding to backend's requests:
Aborted connection 87 to db: 'devlake' user: 'root' host: '10.204.0.214' (Got an error reading communication packets). (sql_connect.cc:845)
Here is what I used for aws_rds_cluster:
  ...

  instance_class  = "db.r6g.large"
  engine          = "aurora-mysql"
  engine_version  = "8.0.mysql_aurora.3.03.0"
And aws_rds_cluster_parameter_group:
  ...

  family = "aurora-mysql8.0"

  parameter {
    name  = "character_set_server"
    value = "utf8mb4"
  }

  parameter {
    name  = "collation_server"
    value = "utf8mb4_bin"
  }

Since you had already switch to mysql from pg, I suggest that you can wipe out the old deployment and redeploy a new one to see if the problem persists.

iholovin commented 1 year ago

@klesh yeah, that's what I did multiple times — got this error even with the fresh deployment

klesh commented 1 year ago

@iholovin the devlake uses a connection pool with 100 connections, seems like there are some failed connections. I suspect it is related to the aurora-mysql settings and this article from google seems likely related: https://dba.stackexchange.com/questions/19135/mysql-error-reading-communication-packets

tag @daniel-hutao , he is working on the similar task at this point, please help to reproduce the problem, thanks.