apache / incubator-devlake

Apache DevLake is an open-source dev data platform to ingest, analyze, and visualize the fragmented data from DevOps tools, extracting insights for engineering excellence, developer experience, and community growth.
https://devlake.apache.org/
Apache License 2.0
2.5k stars 493 forks source link

[Bug][Cronjobs] Devlake engine does not respect cronjob schedules #7640

Open AlexPerkbox opened 1 week ago

AlexPerkbox commented 1 week ago

Search before asking

What happened

Cronjobs are no longer considered and blueprints do not run recurringly based on a schedule.

E.g. This blueprint is supposed to run daily at midnight (3:00AM my local time) but it did not on June 14, 17, and 19. { "name": "DevOps-Demo-Blueprint", "projectName": "DevOps-Demo", "cronConfig": "0 0 * * *", "createdAt": "2024-06-12T09:50:01.797Z", "updatedAt": "2024-06-12T09:50:27.466Z" } image

Other than that, there are some 400 odd blueprints and when changing cronConfig (using API) only a handful of them actually get triggered (although they return 200 OK).

Running devlake-1.0.0-beta11 on AWS ECS/RDS and byog (bring your own grafana)

What do you expect to happen

Whenever a cronjob is updated a pipeline should trigger at the specified date and time.

How to reproduce

Update cronConfig setting either via UI or API and monitor behaviour.

Anything else

No response

Version

devlake-1.0.0-beta11

Are you willing to submit PR?

Code of Conduct

dosubot[bot] commented 1 week ago

Hey @AlexPerkbox! I'm here to assist you with any bugs, questions, or contributions. Let's work together to resolve your issue while we wait for a human maintainer to assist.

The issue you're experiencing with cronjob schedules not being respected in DevLake could be due to improper configuration of the synchronization policy. Specifically, the sync frequency or cron code might not be set correctly. Ensure that the sync frequency or cron code is correctly configured in the synchronization policy to enable recurring runs of the blueprints. Additionally, verify that the "Skip failed tasks" option is appropriately set to prevent interruptions in the data collection process [1][2][3][4].

There are also known performance issues with the devlake-1.0.0-beta11 version that could affect the execution of cronjobs in high-load scenarios. For instance, MySQL InnoDB might fail with the message "The total number of locks exceeds the lock table size" when dealing with large repositories or boards. To mitigate this, you can increase the innodb_buffer_pool_size in your MySQL configuration [5][6][7].

Additionally, there have been similar issues reported and resolved in the past:

  1. [Bug][framework] The blueprint cronjobs get schedule multiple times - This issue was closed by a pull request: fix: ReloadBlueprints is not safe.
  2. [Bug][Config-ui] Manual blueprint still has some info that should be in daily mode - This issue was resolved by pull request #4047 [8][9].

For further assistance, you can refer to the Apache DevLake Config UI tutorial, which provides detailed steps on setting up data connections, creating projects, and configuring synchronization policies. If issues persist, consider checking the troubleshooting section or creating a bug report on the GitHub repository with relevant details such as error messages and logs.

To continue talking to Dosu, mention @dosu.

d4x1 commented 1 week ago

@AlexPerkbox Thank you for your feedback.

  1. About the cronjob. It's weird. Can you provide more details about what you have done(step by step)? we need a stable reproduction.
  2. About the trigger. a. You have many blueprints(projects) b. You update their conjob configs via API, them all return status code 200 OK c. Only a handful of them are triggered successfully(when the cronjob time schedule is reached) Am I right ?
d4x1 commented 1 week ago

@AlexPerkbox You can set environment PIPELINE_MAX_PARALLEL a bigger number and see what will happen.

AlexPerkbox commented 1 week ago

@d4x1 Thank you for your response.

As part of the initial setup, there were a few hundred blueprints created in the form of one blueprint = one gitlab project. It was observed that a complete run to fetch all data from gitlab takes approximately 2 days so the best time to kick-off (schedule) all jobs is Friday afternoon. Bulk-setting cronConfig in the form of

curl --location --request PATCH "https://devlake.my.endpoint.com/api/blueprints/[1-500]" \
--header 'Content-Type: application/json' \
--data '{
    "cronConfig": "10 15 * * 5"
}' | jq

updates all blueprints fine, but when the time comes there is nothing scheduled/created/pending in the pipelines tab.

P.s Increasing PIPELINE_MAX_PARALLEL to 5 does not make a difference.

klesh commented 1 week ago

I suspect that the ECS is the cause of the problem. According to the document on https://aws.amazon.com/ecs/ , containers are fully managed by AWS, maybe it would put containers to sleep when it is not active(processing requests, or using any CPU time). I could not find any concrete proof at this point, but I might do it to save resources(money) if I were AWS.

wouldd commented 1 week ago

I have been seeing this same kind of problem. I have quite a few pipelines, all configured to run daily at 01:00, I find only 5 of them seem to run. there are no obvious errors or reasons why the others do not execute. the 5 that do run don't seem to be consistent. I'm running inside kubernetes

klesh commented 6 days ago

We are also using k8s (EKS) as well, and all pipelines get executed as expected, we also have some k3s clusters for testing and other purposes, and they are all working fine 😂. The chance of this being fixed is slim unless someone who is having the problem could dig deep into the root.

wouldd commented 6 days ago

@klesh if you could give me a couple of pointers at the code which is responsible for triggering these scheduled jobs etc. then I'd be happy to dig in, maybe add some daignostic logging to my local fork.

klesh commented 6 days ago

@wouldd Sure thing, we are using github.com/robfig/cron for scheduling jobs and here is the code calling the lib when creating/updating any blueprint, while the ReloadBlueprints function is called on the startup to load all blueprints from the db.

here is where the cronManager gets initialized after migrations were done, nothing fancy.

Let me know if I missed anything.

wouldd commented 6 days ago

ah, well now that I know what logs I should be looking for. what I see is this: time="2024-06-25 00:00:04" level=error msg=" [blueprint] created cron job failed on blueprint:[1][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ec4fd8 0xc002ec4ff0 0xc002ec5008] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] run cron job failed on blueprint:[1][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ec4fd8 0xc002ec4ff0 0xc002ec5008] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] created cron job failed on blueprint:[21][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ec5068 0xc002ec5080 0xc002ec5098] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] run cron job failed on blueprint:[21][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ec5068 0xc002ec5080 0xc002ec5098] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] created cron job failed on blueprint:[3][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ef53f8 0xc002ef5410 0xc002ef5428] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] run cron job failed on blueprint:[3][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ef53f8 0xc002ef5410 0xc002ef5428] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] created cron job failed on blueprint:[2][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ef55f0 0xc002ef5608 0xc002ef5620] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] run cron job failed on blueprint:[2][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ef55f0 0xc002ef5608 0xc002ef5620] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] created cron job failed on blueprint:[12][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ec5728 0xc002ec5740 0xc002ec5758] (504)" time="2024-06-25 00:00:04" level=error msg=" [blueprint] run cron job failed on blueprint:[12][REDACTED-Blueprint]\n\tcaused by: lock tables timeout: [0xc002ec5728 0xc002ec5740 0xc002ec5758] (504)"

I have 25 blueprints total. the UI shows them all 'scheduled' to run at the next expected schedule, however clearly they mostly fail to actually create in the cron scheduler due to table locks.

AlexPerkbox commented 6 days ago

I would confirm the same behaviour as @wouldd, here is some logs from AWS CloudWatch

2024-06-25T11:30:00.107+03:00
time="2024-06-25 08:30:00" level=error msg=" [blueprint] run cron job failed on blueprint:[495][<gitlab-project-slug>]
caused by: Error 1040: Too many connections (500)
Wraps: (2) Error 1040: Too many connections
Error types: (1) *hintdetail.withDetail (2) *mysql.MySQLError"
d4x1 commented 3 days ago

I would confirm the same behaviour as @wouldd, here is some logs from AWS CloudWatch

2024-06-25T11:30:00.107+03:00
time="2024-06-25 08:30:00" level=error msg=" [blueprint] run cron job failed on blueprint:[495][<gitlab-project-slug>]
caused by: Error 1040: Too many connections (500)
Wraps: (2) Error 1040: Too many connections
Error types: (1) *hintdetail.withDetail (2) *mysql.MySQLError"

@AlexPerkbox You can check AWS's config, turn up the value of something like max_connection_count.

klesh commented 3 days ago

@wouldd Aah, I see. Too many bps are trying to create pipelines simultaneously, causing some table lockers to timeout.

klesh commented 3 days ago

@wouldd Could you spread them to different times and see if the problem goes away?

wouldd commented 3 hours ago

@klesh I spred about half of them out last week and all those ran fine. I've now spread the remainder out. it's obviously a bit of a pain to have to do this manually. but certainly seems to work around the issue.