chaoss / grimoirelab

GrimoireLab: platform for software development analytics and insights
https://chaoss.github.io/grimoirelab/
GNU General Public License v3.0
501 stars 184 forks source link

Only global tasks are started and necessary ES indexes are missing #585

Closed ncsibra-lab49 closed 1 year ago

ncsibra-lab49 commented 1 year ago

We deployed GrimoireLab components to AWS ECS, mostly based on the docker-compose example using the latest image versions. Sortinghat, sortinghat-worker, sirmordred and nginx are separate services. They're using managed AWS resources, MariaDB 10.6.5, Elasticsearch 6.8 and Redis 7.0.

Config files (${...} parts are replaced from env at container start):

setup.cfg ``` [general] short_name = GrimoireLab update = true min_update_delay = 60 debug = true logs_dir = /home/grimoire/logs bulk_size = 100 scroll_size = 100 aliases_file = /home/grimoire/aliases.json [projects] projects_file = /home/grimoire/conf/projects.json [es_collection] url =${ELASTICSEARCH_HOST} [es_enrichment] url = ${ELASTICSEARCH_HOST} autorefresh = true [sortinghat] host = ${NGINX_HOST} user = root password = ${SORTINGHAT_PASSWORD} port = ${NGINX_PORT} path = /identities/api/ ssl = false database = sortinghat_db autoprofile = [github, git] matching = [email,name,username] sleep_for = 100 unaffiliated_group = Unknown affiliate = true [panels] kibiter_time_from = now-90d kibiter_default_index = git kibiter_url = ${KIBANA_ENDPOINT} kibiter_version = 6.8.13 [phases] collection = true identities = true enrichment = true panels = true [git] raw_index = git_demo_raw enriched_index = git_demo_enriched latest-items = true studies = [enrich_demography:git, enrich_areas_of_code:git, enrich_onion:git] [github] api-token = [${GH_TOKEN0},${GH_TOKEN1},${GH_TOKEN2},${GH_TOKEN3},${GH_TOKEN4},${GH_TOKEN5},${GH_TOKEN6},${GH_TOKEN7},${GH_TOKEN8},${GH_TOKEN9}] [github:issue] raw_index = github_raw enriched_index = github_enriched category = issue sleep-for-rate = true no-archive = true studies = [enrich_onion:github,enrich_geolocation:user,enrich_geolocation:assignee,enrich_backlog_analysis,enrich_demography:github] [github:pull] raw_index = github-pull_raw enriched_index = github-pull_enriched category = pull_request sleep-for-rate = true no-archive = true studies = [enrich_geolocation:user,enrich_geolocation:assignee,enrich_demography:github] [github:repo] raw_index = github-repo_raw enriched_index = github-repo_enriched category = repository sleep-for-rate = true no-archive = true studies = [enrich_demography:github] [githubql] raw_index = github_event_raw enriched_index = github_event_enriched api-token = [${GH_TOKEN0},${GH_TOKEN1},${GH_TOKEN2},${GH_TOKEN3},${GH_TOKEN4},${GH_TOKEN5},${GH_TOKEN6},${GH_TOKEN7},${GH_TOKEN8},${GH_TOKEN9}] sleep-for-rate = true sleep-time = "300" no-archive = true studies = [enrich_duration_analysis:kanban, enrich_reference_analysis] [github2] api-token = [${GH_TOKEN0},${GH_TOKEN1},${GH_TOKEN2},${GH_TOKEN3},${GH_TOKEN4},${GH_TOKEN5},${GH_TOKEN6},${GH_TOKEN7},${GH_TOKEN8},${GH_TOKEN9}] [github2:issue] raw_index = github2-issues_raw enriched_index = github2-issues_enriched sleep-for-rate = true category = issue no-archive = true studies = [enrich_geolocation:user, enrich_geolocation:assignee, enrich_feelings] [github2:pull] raw_index = github2-pull_raw enriched_index = github2-pull_enriched sleep-for-rate = true category = pull_request no-archive = true studies = [enrich_geolocation:user, enrich_geolocation:assignee, enrich_feelings] [enrich_demography:git] [enrich_areas_of_code:git] in_index = git_demo_raw out_index = git-aoc_demo_enriched [enrich_onion:git] in_index = git out_index = git-onion_demo_enriched contribs_field = hash [enrich_onion:github] in_index_iss = github_issues_onion-src in_index_prs = github_prs_onion-src out_index_iss = github-issues-onion_enriched out_index_prs = github-prs-onion_enriched [enrich_geolocation:user] location_field = user_location geolocation_field = user_geolocation [enrich_geolocation:assignee] location_field = assignee_location geolocation_field = assignee_geolocation [enrich_feelings] attributes = [title, body] nlp_rest_url = http://localhost:2901 [enrich_backlog_analysis] out_index = github_enrich_backlog interval_days = 7 reduced_labels = [bug,enhancement] map_label = [others, bugs, enhancements] [enrich_demography:github] [enrich_duration_analysis:kanban] start_event_type = MovedColumnsInProjectEvent fltr_attr = board_name target_attr = board_column fltr_event_types = [MovedColumnsInProjectEvent, AddedToProjectEvent] [enrich_duration_analysis:label] start_event_type = UnlabeledEvent target_attr = label fltr_attr = label fltr_event_types = [LabeledEvent] [enrich_reference_analysis] ```
projects.json structure ``` { "Project1": { "git": [ ... ], "github2:issue": [ ... ], "github2:pull": [ ... ], "github:issue": [ ... ], "github:pull": [ ... ], "github:repo": [ ... ], "githubql": [ ... ] }, "Project2": { "git": [ ... ], "github2:issue": [ ... ], "github2:pull": [ ... ], "github:issue": [ ... ], "github:pull": [ ... ], "github:repo": [ ... ], "githubql": [ ... ] } } ```

The real projects.json contains around 30.000 lines, overall ~4000 repo in every category.

At first, it worked correctly, collected 8,533,696 documents in elasticsearch and created these indexes/aliases:

indices ``` health status index uuid pri rep docs.count docs.deleted store.size pri.store.size yellow open .grimoirelab-sigils OOP63k6SS_iMoBPbnKA6WA 1 1 39 0 16.5kb 16.5kb green open .kibana_1 5e1J35IPTtm7RRE9MY02CQ 1 0 245 21 412.1kb 412.1kb yellow open git-aoc_demo_enriched m8AoYdcET1Cv3FSyBNO1dw 1 1 6421491 0 3.4gb 3.4gb yellow open git-onion_demo_enriched H_VOYXZNT4aw5G7Zfjr2Mw 1 1 3543 0 848.1kb 848.1kb yellow open git_demo_enriched BjK5KJ6FTNOsm_gG4iuing 1 1 853195 20562 1.7gb 1.7gb yellow open git_demo_raw Qovgzh_SRB-uflQaVL-Wfw 1 1 854108 12815 1gb 1gb yellow open github-pull_raw BPfXonH4RQuIlradBFTcGA 1 1 15291 729 172.6mb 172.6mb yellow open github-repo_enriched urkhbFoQSkCFX4a2V1ukYg 1 1 34724 219 12mb 12mb yellow open github-repo_raw x5nbW7BARNWFtTCmo2WNdg 1 1 39289 0 99.1mb 99.1mb yellow open github2-issues_raw Tc5QIBzjQ7aFN95RYcCWYQ 1 1 95354 1128 512.3mb 512.3mb yellow open github2-pull_raw VVX-NqiEQoemv8nbYveImg 1 1 15191 729 171.1mb 171.1mb yellow open github_event_raw AEuBFKoURwi4Xc0R0h6ZBA 1 1 105872 0 307.2mb 307.2mb yellow open github_raw _FK1r5amQYKz9KtNIRhu8A 1 1 95354 1134 514.7mb 514.7mb ````
aliases ``` alias index filter routing.index routing.search .kibana .kibana_1 - - - affiliations git_demo_enriched - - - all_enriched git_demo_enriched - - - demographics git_demo_enriched - - - git git_demo_enriched - - - git-raw git_demo_raw - - - git_areas_of_code git-aoc_demo_enriched - - - git_author git_demo_enriched - - - git_enrich git_demo_enriched - - - github-raw github_raw - - - github2_issues-raw github2-issues_raw - - - github2_pull_requests-raw github2-pull_raw - - - github_events-raw github_event_raw - - - github_pull_requests-raw github-pull_raw - - - github_repositories github-repo_enriched - - - github_repositories-raw github-repo_raw - - - ```

The first problem is that not all the indexes are created from the setup.cfg e.g.: github_enriched, github-pull_enriched, github_event_enriched, etc. This could be because it wasn't able to finish the whole enrichment process.

The bigger issue is, that now it does not start any of the git/github tasks, only the global tasks:

all.log

``` 2023-03-20 12:35:11,911 - sirmordred.sirmordred - INFO - 2023-03-20 12:35:11,911 - sirmordred.sirmordred - INFO - ---------------------------- 2023-03-20 12:35:11,911 - sirmordred.sirmordred - INFO - Starting SirMordred engine ... 2023-03-20 12:35:11,911 - sirmordred.sirmordred - INFO - - - - - - - - - - - - - - - 2023-03-20 12:35:22,009 - sirmordred.sirmordred - INFO - Loading projects 2023-03-20 12:35:28,504 - sirmordred.sirmordred - INFO - Projects loaded 2023-03-20 12:35:28,514 - sirmordred.sirmordred - INFO - TaskProjects TaskIdentitiesMerge will be executed on Mon, 20 Mar 2023 12:37:08 2023-03-20 12:35:30,201 - sirmordred.task_projects - INFO - Reading projects data from /home/grimoire/conf/projects.json 2023-03-20 12:35:31,202 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm email 2023-03-20 12:35:42,354 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: da40b6aa-e173-469a-b473-28bcd8f223c1 2023-03-20 12:36:53,094 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm name 2023-03-20 12:37:06,286 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 70a4a8f1-077f-4b54-aeaa-ec5b449175eb 2023-03-20 12:38:17,005 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm username 2023-03-20 12:38:28,969 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 040cd3d7-f77c-49e1-a666-f9da6932fd9f 2023-03-20 12:39:09,512 - sgqlc.endpoint.http - ERROR - http://aslive-nginx.github-metrics-dashboard:8000/identities/api/: HTTP Error 503: Service Unavailable 2023-03-20 12:39:09,512 - sgqlc.endpoint.http - INFO - Response header: connection: close 2023-03-20 12:39:09,512 - sgqlc.endpoint.http - INFO - Response header: content-length: 168 2023-03-20 12:39:09,512 - sgqlc.endpoint.http - INFO - Response header: content-type: text/plain 2023-03-20 12:39:09,512 - sgqlc.endpoint.http - INFO - Response header: date: Mon, 20 Mar 2023 12:39:09 GMT 2023-03-20 12:39:09,512 - sgqlc.endpoint.http - INFO - Response header: server: envoy 2023-03-20 12:39:09,513 - sgqlc.endpoint.http - INFO - Response [text/plain]: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure, transport failure reason: delayed connect error: 113 2023-03-20 12:39:09,513 - grimoire_elk.enriched.sortinghat_gelk - ERROR - [sortinghat] Error unify criteria: ['username'] {'message': 'HTTP Error 503: Service Unavailable', 'exception': , 'status': 503, 'headers': , 'body': 'upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure, transport failure reason: delayed connect error: 113'} 2023-03-20 12:39:09,513 - sirmordred.task_identities - INFO - [sortinghat] Executing affiliate 2023-03-20 12:39:18,728 - sgqlc.endpoint.http - ERROR - http://aslive-nginx.github-metrics-dashboard:8000/identities/api/: HTTP Error 503: Service Unavailable 2023-03-20 12:39:18,728 - sgqlc.endpoint.http - INFO - Response header: connection: close 2023-03-20 12:39:18,728 - sgqlc.endpoint.http - INFO - Response header: content-length: 168 2023-03-20 12:39:18,728 - sgqlc.endpoint.http - INFO - Response header: content-type: text/plain 2023-03-20 12:39:18,728 - sgqlc.endpoint.http - INFO - Response header: date: Mon, 20 Mar 2023 12:39:18 GMT 2023-03-20 12:39:18,728 - sgqlc.endpoint.http - INFO - Response header: server: envoy 2023-03-20 12:39:18,728 - sgqlc.endpoint.http - INFO - Response [text/plain]: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure, transport failure reason: delayed connect error: 113 2023-03-20 12:39:18,729 - sirmordred.task_identities - INFO - [sortinghat] Autogender not configured. Skipping. 2023-03-20 12:39:18,729 - sirmordred.task_manager - INFO - [Global tasks] sleeping for 100 seconds 2023-03-20 12:40:59,826 - sirmordred.task_projects - INFO - Reading projects data from /home/grimoire/conf/projects.json 2023-03-20 12:41:00,828 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm email 2023-03-20 12:41:28,267 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 98d3c123-61e6-4f87-ae70-889df9411aee 2023-03-20 12:42:39,088 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm name 2023-03-20 12:42:50,132 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 5901a732-f894-4d9e-a832-24aca1d0573f 2023-03-20 12:44:00,846 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm username 2023-03-20 12:44:13,938 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 5aa046f1-b8a9-49f7-bf2d-36aa51f16b12 2023-03-20 12:45:19,764 - sirmordred.task_identities - INFO - [sortinghat] Executing affiliate 2023-03-20 12:45:31,444 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Affiliate job id: 817d5976-5b4e-401d-9def-193a88fce607 2023-03-20 12:46:42,323 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Affiliate finished job id: 817d5976-5b4e-401d-9def-193a88fce607 2023-03-20 12:46:42,325 - sirmordred.task_identities - INFO - [sortinghat] Autogender not configured. Skipping. 2023-03-20 12:46:42,325 - sirmordred.task_manager - INFO - [Global tasks] sleeping for 100 seconds 2023-03-20 12:48:23,423 - sirmordred.task_projects - INFO - Reading projects data from /home/grimoire/conf/projects.json 2023-03-20 12:48:24,425 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm email 2023-03-20 12:48:51,317 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 30233057-6762-4afa-9158-4d9a4ee3c58a 2023-03-20 12:50:02,183 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm name 2023-03-20 12:50:15,415 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 14b0af80-fe71-4252-9ede-3e74d0b15b3c 2023-03-20 12:51:26,311 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm username 2023-03-20 12:51:37,014 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Unify job id: 5db0fe50-3ebb-4559-8896-99f9c70b4fd7 2023-03-20 12:52:42,722 - sirmordred.task_identities - INFO - [sortinghat] Executing affiliate 2023-03-20 12:52:54,807 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Affiliate job id: db8c28c9-9a68-476a-a1c5-f60b7b2430d3 2023-03-20 12:54:05,702 - grimoire_elk.enriched.sortinghat_gelk - INFO - [sortinghat] Affiliate finished job id: db8c28c9-9a68-476a-a1c5-f60b7b2430d3 2023-03-20 12:54:05,705 - sirmordred.task_identities - INFO - [sortinghat] Autogender not configured. Skipping. 2023-03-20 12:54:05,705 - sirmordred.task_manager - INFO - [Global tasks] sleeping for 100 seconds ```

Enabled debug logging too, but does not show any kind of error or other reason why the git/github etc. tasks are not running. But looks like it knows that some backend task should be started, but only start global tasks.

debug log snippet

``` 2023-03-21 08:38:03,600 - sirmordred.sirmordred - INFO - Projects loaded 2023-03-21 08:38:03,602 - sirmordred.sirmordred - DEBUG - backend_tasks = [, ] 2023-03-21 08:38:03,602 - sirmordred.sirmordred - DEBUG - global_tasks = [, ] 2023-03-21 08:38:03,603 - sirmordred.task_manager - DEBUG - [Global tasks] Task starts 2023-03-21 08:38:03,603 - sirmordred.sirmordred - INFO - TaskProjects TaskIdentitiesMerge will be executed on Tue, 21 Mar 2023 08:39:43 2023-03-21 08:38:03,603 - sirmordred.task_manager - DEBUG - [, ] ```

Tried to restart the container multiple times, but does not help. Any idea what's the issue here?

atomheartmother commented 1 year ago

I had a similar issue with sirmordred whereby it would complete one cycle of collection and enrichment but then not start another, it would only perform the sortinghat tasks on a loop. I would be interested in what the solution is here.

ncsibra-lab49 commented 1 year ago

I figured out the problem, basically a synchronization issue in sirmordred. I added a few extra debug logs to see the issue.

debug log

``` 2023-03-28 11:22:24,978 - sirmordred.sirmordred - INFO - 2023-03-28 11:22:24,978 - sirmordred.sirmordred - INFO - ---------------------------- 2023-03-28 11:22:24,978 - sirmordred.sirmordred - INFO - Starting SirMordred engine ... 2023-03-28 11:22:24,978 - sirmordred.sirmordred - INFO - - - - - - - - - - - - - - - 2023-03-28 11:22:24,983 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): vpc-***.es.amazonaws.com:443 2023-03-28 11:22:25,004 - urllib3.connectionpool - DEBUG - https://vpc-***.es.amazonaws.com:443 "GET / HTTP/1.1" 200 515 2023-03-28 11:22:25,014 - urllib3.connectionpool - DEBUG - https://vpc-***.es.amazonaws.com:443 "GET / HTTP/1.1" 200 515 2023-03-28 11:22:25,015 - sirmordred.sirmordred - DEBUG - backend_tasks = [] 2023-03-28 11:22:25,015 - sirmordred.sirmordred - DEBUG - global_tasks = [, ] 2023-03-28 11:22:25,015 - sirmordred.task_manager - DEBUG - [Global tasks] Task starts 2023-03-28 11:22:25,015 - sirmordred.task_manager - DEBUG - [, ] 2023-03-28 11:22:25,018 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): nginx.***:8000 2023-03-28 11:22:26,017 - sirmordred.sirmordred - DEBUG - stopper set 2023-03-28 11:22:28,126 - urllib3.connectionpool - DEBUG - http://nginx.***:8000 "GET /identities/api/ HTTP/1.1" 400 79 2023-03-28 11:22:28,130 - sgqlc.endpoint.http - DEBUG - Query: mutation { tokenAuth(username: "***", password: "***") { token } } 2023-03-28 11:22:31,448 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): nginx.***:8000 2023-03-28 11:22:34,540 - urllib3.connectionpool - DEBUG - http://nginx.***:8000 "GET /identities/api/ HTTP/1.1" 400 79 2023-03-28 11:22:34,542 - sgqlc.endpoint.http - DEBUG - Query: mutation { tokenAuth(username: "***", password: "***") { token } } 2023-03-28 11:22:38,346 - sirmordred.task_panels - DEBUG - Active data sources for menu: ['git', 'github', 'github', 'community'] 2023-03-28 11:22:38,347 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks will be executed in this order: [, ] 2023-03-28 11:22:38,347 - sirmordred.task_manager - DEBUG - [Global tasks] Task is exiting 2023-03-28 11:22:38,347 - sirmordred.sirmordred - DEBUG - [thread:main] No exceptions in threads queue. Let's continue .. 2023-03-28 11:22:38,347 - sirmordred.sirmordred - DEBUG - [thread:main] All threads (and their tasks) are finished 2023-03-28 11:22:38,347 - sirmordred.sirmordred - INFO - Loading projects 2023-03-28 11:22:38,347 - sirmordred.sirmordred - DEBUG - backend_tasks = [] 2023-03-28 11:22:38,347 - sirmordred.sirmordred - DEBUG - global_tasks = [] 2023-03-28 11:22:38,348 - sirmordred.task_manager - DEBUG - [Global tasks] Task starts 2023-03-28 11:22:38,348 - sirmordred.task_manager - DEBUG - [] 2023-03-28 11:22:38,350 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): nginx.***:8000 2023-03-28 11:22:39,349 - sirmordred.sirmordred - DEBUG - stopper set 2023-03-28 11:22:41,456 - urllib3.connectionpool - DEBUG - http://nginx.***:8000 "GET /identities/api/ HTTP/1.1" 400 79 2023-03-28 11:22:41,457 - sgqlc.endpoint.http - DEBUG - Query: mutation { tokenAuth(username: "***", password: "***") { token } } 2023-03-28 11:22:41,633 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks will be executed in this order: [] 2023-03-28 11:22:41,633 - sirmordred.task_manager - DEBUG - [Global tasks] Task is exiting 2023-03-28 11:22:41,633 - sirmordred.sirmordred - DEBUG - [thread:main] No exceptions in threads queue. Let's continue .. 2023-03-28 11:22:41,633 - sirmordred.sirmordred - DEBUG - [thread:main] All threads (and their tasks) are finished 2023-03-28 11:22:41,633 - sirmordred.sirmordred - INFO - Projects loaded 2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - backend_tasks = [, ] 2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - global_tasks = [, ] 2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - TaskProjects get_projects: {} 2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - repos found: {} 2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - repos to be retrieved: {} 2023-03-28 11:22:41,635 - sirmordred.task_manager - DEBUG - [Global tasks] Task starts 2023-03-28 11:22:41,635 - sirmordred.sirmordred - INFO - TaskProjects TaskIdentitiesMerge will be executed on Tue, 28 Mar 2023 11:24:21 2023-03-28 11:22:41,635 - sirmordred.task_manager - DEBUG - [, ] 2023-03-28 11:22:41,636 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): nginx.***:8000 2023-03-28 11:22:41,642 - urllib3.connectionpool - DEBUG - http://nginx.***:8000 "GET /identities/api/ HTTP/1.1" 400 79 2023-03-28 11:22:41,643 - sgqlc.endpoint.http - DEBUG - Query: mutation { tokenAuth(username: "***", password: "***") { token } } 2023-03-28 11:22:41,842 - urllib3.connectionpool - DEBUG - Starting new HTTP connection (1): nginx.***:8000 2023-03-28 11:22:41,847 - urllib3.connectionpool - DEBUG - http://nginx.***:8000 "GET /identities/api/ HTTP/1.1" 400 79 2023-03-28 11:22:41,848 - sgqlc.endpoint.http - DEBUG - Query: mutation { tokenAuth(username: "***", password: "***") { token } } 2023-03-28 11:22:42,023 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks will be executed in this order: [, ] 2023-03-28 11:22:43,024 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks started: 2023-03-28 11:22:43,025 - sirmordred.task_projects - INFO - Reading projects data from /home/grimoire/conf/projects.json 2023-03-28 11:22:43,050 - sirmordred.task_projects - DEBUG - Projects file has changed 2023-03-28 11:22:43,050 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks finished: 2023-03-28 11:22:43,050 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks started: 2023-03-28 11:22:44,050 - sirmordred.task_identities - DEBUG - [unify] Enrich tasks active: 0 2023-03-28 11:22:44,051 - sirmordred.task_identities - INFO - [sortinghat] Unifying identities using algorithm email ```

The interesting parts are:

2023-03-28 11:22:38,347 - sirmordred.sirmordred - INFO - Loading projects
2023-03-28 11:22:38,347 - sirmordred.sirmordred - DEBUG - backend_tasks = []
2023-03-28 11:22:38,347 - sirmordred.sirmordred - DEBUG - global_tasks = [<class 'sirmordred.task_projects.TaskProjects'>]
2023-03-28 11:22:38,348 - sirmordred.task_manager - DEBUG - [Global tasks] Task starts
2023-03-28 11:22:38,348 - sirmordred.task_manager - DEBUG - [<class 'sirmordred.task_projects.TaskProjects'>]
2023-03-28 11:22:39,349 - sirmordred.sirmordred - DEBUG - stopper set
2023-03-28 11:22:41,633 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks will be executed in this order: [<sirmordred.task_projects.TaskProjects object at 0x7fd6847d9070>]
2023-03-28 11:22:41,633 - sirmordred.task_manager - DEBUG - [Global tasks] Task is exiting
2023-03-28 11:22:41,633 - sirmordred.sirmordred - DEBUG - [thread:main] No exceptions in threads queue. Let's continue ..
2023-03-28 11:22:41,633 - sirmordred.sirmordred - DEBUG - [thread:main] All threads (and their tasks) are finished
2023-03-28 11:22:41,633 - sirmordred.sirmordred - INFO - Projects loaded
2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - backend_tasks = [<class 'sirmordred.task_collection.TaskRawDataCollection'>, <class 'sirmordred.task_enrich.TaskEnrich'>]
2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - global_tasks = [<class 'sirmordred.task_projects.TaskProjects'>, <class 'sirmordred.task_identities.TaskIdentitiesMerge'>]
2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - TaskProjects get_projects: {}
2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - repos found: {}
2023-03-28 11:22:41,634 - sirmordred.sirmordred - DEBUG - repos to be retrieved: {}
2023-03-28 11:22:41,635 - sirmordred.task_manager - DEBUG - [Global tasks] Task starts

At first, the __execute_initial_load function tries to collect the projects from the configured projects_file or projects_url. In this case, the wait_for_threads flag is set to True in the execute_batch_tasks function which sets the stopper after 1 second and that's the cause of the issue.

if wait_for_threads:
    time.sleep(1)  # Give enough time create and run all threads
    stopper.set()  # All threads must stop in the next iteration
    logger.debug('stopper set')

I added an extra log there.

In task_manager.py the loop only runs when the stopper not set.

As you can see from the logs stopper.set() runs before even TaskManager had a chance to run the loop:

2023-03-28 11:22:38,348 - sirmordred.task_manager - DEBUG - [Global tasks] Task starts
2023-03-28 11:22:38,348 - sirmordred.task_manager - DEBUG - [<class 'sirmordred.task_projects.TaskProjects'>]
2023-03-28 11:22:39,349 - sirmordred.sirmordred - DEBUG - stopper set
2023-03-28 11:22:41,633 - sirmordred.task_manager - DEBUG - [Global tasks] Tasks will be executed in this order: [<sirmordred.task_projects.TaskProjects object at 0x7fd6847d9070>]
2023-03-28 11:22:41,633 - sirmordred.task_manager - DEBUG - [Global tasks] Task is exiting

There's a 3 seconds delay between Task start and Tasks will be executed in this order which is logged just before the loop. So the projects are not collected at first and when running execute_batch_tasks again from execute_nonstop_tasks with wait_for_threads=False it will not found any projects and won't try to create backend threads again, because the global tasks thread will never finish, so execute_batch_tasks won't run again.

In a limited environment, as we created in ECS, this time-based synchronization will not work. Currently, in execute_batch_tasks function child threads should stop when the parent says but feels like the real requirement in wait_for_threads=True case is to run the TaskManager loop at least once and then stop the thread. So the child thread should notify the parent thread when it is finished.

My quick and dirty solution was to change the loop in TaskManager to Loop and a Half pattern.

So from this:

while not self.stopper.is_set():
    # we give 1 extra second to the stopper, so this loop does
    # not finish before it is set.
    time.sleep(1)

    for task in self.tasks:
        logger.debug('[%s] Tasks started: %s', self.backend_section, task)
        try:
            task.execute()
        except Exception as ex:
            logger.error("[%s] Exception in Task Manager %s", self.backend_section, ex, exc_info=True)
            TasksManager.COMM_QUEUE.put(sys.exc_info())
            raise
        logger.debug('[%s] Tasks finished: %s', self.backend_section, task)

    timer = self.__get_timer(self.backend_section)
    if timer > 0 and self.config.get_conf()['general']['update']:
        logger.info("[%s] sleeping for %s seconds ", self.backend_section, timer)
        time.sleep(timer)

To this:

while True:
    # we give 1 extra second to the stopper, so this loop does
    # not finish before it is set.
    time.sleep(1)

    for task in self.tasks:
        logger.debug('[%s] Tasks started: %s', self.backend_section, task)
        try:
            task.execute()
        except Exception as ex:
            logger.error("[%s] Exception in Task Manager %s", self.backend_section, ex, exc_info=True)
            TasksManager.COMM_QUEUE.put(sys.exc_info())
            raise
        logger.debug('[%s] Tasks finished: %s', self.backend_section, task)

    timer = self.__get_timer(self.backend_section)
    if timer > 0 and self.config.get_conf()['general']['update']:
        logger.info("[%s] sleeping for %s seconds ", self.backend_section, timer)
        time.sleep(timer)

    if self.stopper.is_set():
        break

Now all the backend tasks are started too because the projects are collected correctly.

sduenas commented 1 year ago

I never liked how this was implemented. It's a bit messy the way threads and tasks are handled so it's prone to bugs like the one you are mentioning.

@ncsibra-lab49 I think your code fixes this problem but I'm not sure if it could lead to other race conditions. Any insight about it?

ncsibra-lab49 commented 1 year ago

I don't think it leads to a race condition because the shared objects (backend_tasks, global_tasks, stopper, etc.) are not modified in the child or parent thread. But as I said, my change is just a quick and dirty solution (more like a workaround), I wouldn't put it in the actual codebase.

The real solution is to avoid time-based synchronization. I think one option is to remove stopper and just simply pass a parameter to TaskManager to run indefinitely or not, then join on the threads. When run from execute_nonstop_tasks the threads never stop anyway, so it would be the same as the current behavior.

The other option is to control the number of runs in the parent process, so remove the loop from TaskManager and move it to SirMordred. The advantage of this, that we can do anything between two runs, like reloading the backend list with _get_repos_by_backend

sduenas commented 1 year ago

This issue is fixed now. The new release of GrimoireLab will include the fix.