GSA / catalog.data.gov

Development environment for catalog.data.gov
https://catalog.data.gov
55 stars 15 forks source link

๐Ÿ“Œ DB Solr Sync Auditing Log #848

Open nickumia-reisys opened 1 year ago

nickumia-reisys commented 1 year ago

Workflow with Issue: 4 - Automated CKAN Jobs Job being auditied: ckan-auto-command CKAN Command (in question): ckan geodatagov db-solr-sync CKAN Command Schedule: 0 3 * Cloud.gov Environment: prod Total Execution Time: 496

Last Commit: de963a357068574c8da0580434779d8db7076d03 Number of times run: 1 Last run by: nickumia-reisys Github Action Run: https://github.com/GSA/catalog.data.gov/actions/runs/12002518348

nickumia-reisys commented 1 year ago
March 2023
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 3/4/2023 104 :a1, 00:00, 8m 23(staging) :after a1, 1m section 3/5/2023 104 :a2, 00:00, 8m section 3/6/2023 58 :a3, 00:00, 8m section 3/7/2023 101 :a4, 00:00, 8.25m section 3/8/2023 101 :a4, 00:00, 494s section 3/9/2023 15853 :a4, 00:00, 2254s section 3/10/2023 13355 :a4, 00:00, 1799s section 3/12/2023 481 :a4, 00:00, 510s section 3/13/2023 30472 :a4, 00:00, 2949s section 3/14/2023 7511 :a4, 00:00, 1114s section 3/15/2023 527 :a4, 00:00, 516s section 3/23/2023 2747 :a4, 00:00, 762s section 3/24/2023 2747 :a4, 00:00, 740s section 3/27/2023 4072 :a4, 00:00, 863s section 3/29/2023 4131 :a4, 00:00, 871s section 3/30/2023 4131 :a4, 00:00, 864s section 3/31/2023 33507 :a4, 00:00, 3570s ```
April 2023
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 4/03/2023 4698 :a4, 00:00, 913s section 4/04/2023 4698 :a4, 00:00, 911s section 4/05/2023 4698 :a4, 00:00, 925s section 4/06/2023 4697 :a4, 00:00, 829s section 4/07/2023 10628 :a4, 00:00, 1220s section 4/08/2023 33322 updated (4 removed) :a4, 00:00, 3406s section 4/09/2023 5617 :a4, 00:00, 858s section 4/10/2023 5061 :a4, 00:00, 799s section 4/11/2023 5063 :a4, 00:00, 832s section 4/12/2023 5063 :a4, 00:00, 889s section 4/13/2023 5063 :a4, 00:00, 837s ```
May 2023
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 5/09/2023 8251 :a4, 00:00, 1062s section 5/10/2023 8251 :a4, 00:00, 1015s section 5/11/2023 8251 :a4, 00:00, 1100s section 5/15/2023 8805 :a4, 00:00, 1129s section 5/16/2023 8734 :a4, 00:00, 1134s section 5/17/2023 8734 :a4, 00:00, 1111s section 5/18/2023 8734 :a4, 00:00, 1629s section 5/19/2023 8688 :a4, 00:00, 1693s section 5/23/2023 6860 :a4, 00:00, 1429s section 5/24/2023 6850 :a4, 00:00, 1443s section 5/25/2023 6848 :a4, 00:00, 1410s section 5/26/2023 6892 :a4, 00:00, 1444s section 5/30/2023 8409 :a4, 00:00, 2921s section 5/31/2023 8291 :a4, 00:00, 1647s ```
June 2023
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 6/1/2023 8188 :a4, 00:00, 1547s section 6/2/2023 23911 :a4, 00:00, 4234s section 6/20/2023 11099 :a4, 00:00, 3551s section 6/21/2023 11099 :a4, 00:00, 3621s section 6/22/2023 8485 :a4, 00:00, 39m section 6/26/2023 925 :a4, 00:00, 756s section 6/27/2023 0 :a4, 00:00, 456s section 6/28/2023 100 :a4, 00:00, 506s section 6/29/2023 0 :a4, 00:00, 461s section 6/30/2023 100 :a4, 00:00, 488s ```
July 2023
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 7/3/2023 0 (removed) | 100 (updated) | 2324 (manual) :a4, 00:00, 525s section 7/4/2023 0 (removed) | 100 (updated) | 2324 (manual) :a4, 00:00, 502s section 7/5/2023 0 (removed) | 0 (updated) | 0 (manual) :a4, 00:00, 463s section 7/6/2023 0 (removed) | 100 (updated) | 0 (manual) :a4, 00:00, 492s section 7/10/2023 1 (removed) | 1166 (updated) | 3603 (manual) :a4, 00:00, 893s section 7/11/2023 0 (removed) | 100 (updated) | 123 (manual) :a4, 00:00, 505s section 7/12/2023 0 (removed) | 100 (updated) | 123 (manual) :a4, 00:00, 495s section 7/13/2023 0 (removed) | 0 (updated) | 123 (manual) :a4, 00:00, 486s section 7/14/2023 0 (removed) | 100 (updated) | 123 (manual) :a4, 00:00, 489s section 7/26/2023 0 (removed) | 100 (updated) | 2344 (manual) :a4, 00:00, 522s section 7/27/2023 81 (removed) | 271 (updated) | 221 (manual) :a4, 00:00, 649s section 7/31/2023 0 (removed) | 99 (updated) | 1912 (manual) :a4, 00:00, 547s ```
August2023
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 8/1/2023 0 (removed) | 96 (updated) | 1912 (manual) :a4, 00:00, 558s section 8/2/2023 0 (removed) | 99 (updated) | 1912 (manual) :a4, 00:00, 547s section 8/3/2023 45 (removed) | 312 (updated) | 225 (manual) :a4, 00:00, 744s section 8/4/2023 0 (removed) | 2719 (updated) | 330 (manual) :a4, 00:00, 1294s section 8/7/2023 5 (removed) | 1102 (updated) | 1422 (manual) :a4, 00:00, 907s section 8/8/2023 0 (removed) | 1574 (updated) | 1422 (manual) :a4, 00:00, 989s section 8/9/2023 0 (removed) | 101 (updated) | 1422 (manual) :a4, 00:00, 557s section 8/10/2023 0 (removed) | 42 (updated) | 1422 (manual) :a4, 00:00, 536s section 8/11/2023 56 (removed) | 40 (updated) | 1451 (manual) :a4, 00:00, 570s section 8/22/2023 0 (removed) | 18 (updated) | 5306 (manual) :a4, 00:00, 529s section 8/23/2023 0 (removed) | 19 (updated) | 5306 (manual) :a4, 00:00, 517s section 8/24/2023 82 (removed) | 156 (updated) | 5306 (manual) :a4, 00:00, 774s section 8/27/2023 18 (removed) | 677 (updated) | 1154 (manual) :a4, 00:00, 976s section 8/28/2023 0 (removed) | 34 (updated) | 1154 (manual) :a4, 00:00, 518s section 8/29/2023 0 (removed) | 25 (updated) | 2828 (manual) :a4, 00:00, 513s section 8/30/2023 0 (removed) | 0 (updated) | 1156 (manual) :a4, 00:00, 508s section 8/31/2023 36 (removed) | 2166 (updated) | 216 (manual) :a4, 00:00, 1760s ```
September 2023
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 9/11/2023 0 (removed) | 24 (updated) | 216 (manual) :a4, 00:00, 511s section 9/13/2023 0 (removed) | 0 (updated) | 216 (manual) :a4, 00:00, 518s section 9/14/2023 0 (removed) | 24 (updated) | 216 (manual) :a4, 00:00, 496s section 9/19/2023 0 (removed) | 24 (updated) | 223 (manual) :a4, 00:00, 506s section 9/20/2023 0 (removed) | 610 (updated) | 225 (manual) :a4, 00:00, 670s section 9/21/2023 0 (removed) | 37 (updated) | 225 (manual) :a4, 00:00, 490s section 9/22/2023 0 (removed) | 34 (updated) | 225 (manual) :a4, 00:00, 491s section 9/25/2023 0 (removed) | 152 (updated) | 225 (manual) :a4, 00:00, 529s section 9/26/2023 0 (removed) | 3 (updated) | 225 (manual) :a4, 00:00, 514s section 9/27/2023 0 (removed) | 2 (updated) | 225 (manual) :a4, 00:00, 492s section 9/28/2023 0 (removed) | 2 (updated) | 226 (manual) :a4, 00:00, 484s section 9/29/2023 0 (removed) | 1 (updated) | 232 (manual) :a4, 00:00, 486s ```
October 2023
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 10/02/2023 32 (removed) | 4671 (updated) | 472 (manual) :a4, 00:00, 2097s section 10/03/2023 0 (removed) | 17070 (updated) | 1961 (manual) :a4, 00:00, 3792s section 10/04/2023 9 (removed) | 6312 (updated) | 1964 (manual) :a4, 00:00, 3033s section 10/05/2023 0 (removed) | 3829 (updated) | 1964 (manual) :a4, 00:00, 1374s section 10/06/2023 9 (removed) | 707 (updated) | 1964 (manual) :a4, 00:00, 641s section 10/16/2023 0 (removed) | 9667 (updated) | 5578 (manual) :a4, 00:00, 1241s section 10/19/2023 0 (removed) | 64 (updated) | 140 (manual) :a4, 00:00, 518s section 10/20/2023 0 (removed) | 42 (updated) | 140 (manual) :a4, 00:00, 520s section 10/23/2023 0 (removed) | 1173 (updated) | 186 (manual) :a4, 00:00, 588s section 10/27/2023 87 (removed) | 880 (updated) | 225 (manual) :a4, 00:00, 826s ```
November 2023
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 11/20/2023 0 (removed) | 4341 (updated) | 247 (manual) :a4, 00:00, 985s section 11/21/2023 0 (removed) | 2435 (updated) | 243 (manual) :a4, 00:00, 657s section 11/22/2023 0 (removed) | 1984 (updated) | 243 (manual) :a4, 00:00, 769s section 11/24/2023 0 (removed) | 853 (updated) | 243 (manual) :a4, 00:00, 588s section 11/27/2023 0 (removed) | 4123 (updated) | 252 (manual) :a4, 00:00, 988s section 11/29/2023 0 (removed) | 2483 (updated) | 252 (manual) :a4, 00:00, 713s section 11/30/2023 2 (removed) | 1794 (updated) | 252 (manual) :a4, 00:00, 566s ```
December 2023
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 12/01/2023 0 (removed) | 1608 (updated) | 252 (manual) :a4, 00:00, 835s section 12/07/2023 0 (removed) | 0 (updated) | 0 (manual) :a4, 00:00, 558s section 12/11/2023 0 (removed) | 0 (updated) | 58 (manual) :a4, 00:00, 517s section 12/12/2023 0 (removed) | 1 (updated) | 535 (manual) :a4, 00:00, 530s section 12/26/2023 0 (removed) | 593 (updated) | 64 (manual) :a4, 00:00, 677s ```
January 2024
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 01/04/2024 0 (removed) | 0 (updated) | 55 (manual) :a4, 00:00, 543s section 01/08/2024 0 (removed) | 0 (updated) | 190 (manual) :a4, 00:00, 625s section 01/12/2024 0 (removed) | 1 (updated) | 190 (manual) :a4, 00:00, 590s section 01/16/2024 0 (removed) | 6203 (updated) | 247 (manual) :a4, 00:00, 2427s ```
February 2024
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 02/06/2024 0 (removed) | 0 (updated) | 279 (manual) :a4, 00:00, 594s section 02/09/2024 0 (removed) | 1977 (updated) | 279 (manual) :a4, 00:00, 1139s section 02/13/2024 0 (removed) | 0 (updated) | 280 (manual) :a4, 00:00, 623s section 02/20/2024 0 (removed) | 0 (updated) | 299 (manual) :a4, 00:00, 574s ```
March 2024
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 03/04/2024 0 (removed) | 0 (updated) | 924 (manual) :a4, 00:00, 576s section 03/06/2024 0 (removed) | 1 (updated) | 0 (manual) :a4, 00:00, 591s section 03/08/2024 0 (removed) | 5 (updated) | 1 (manual) :a4, 00:00, 575s section 03/18/2024 0 (removed) | 0 (updated) | 333 (manual) :a4, 00:00, 531s section 03/20/2024 0 (removed) | 0 (updated) | 333 (manual) :a4, 00:00, 531s section 03/22/2024 0 (removed) | 1 (updated) | 333 (manual) :a4, 00:00, 537s section 03/26/2024 0 (removed) | 0 (updated) | 333 (manual) :a4, 00:00, 534s ```
April 2024
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 04/03/2024 0 (removed) | 0 (updated) | 2474 (manual) :a4, 00:00, 520s section 04/23/2024 0 (removed) | 0 (updated) | 429 (manual) :a4, 00:00, 520s section 04/25/2024 0 (removed) | 0 (updated) | 429 (manual) :a4, 00:00, 555s section 04/29/2024 0 (removed) | 0 (updated) | 429 (manual) :a4, 00:00, 534s ```
May 2024
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 05/02/2024 0 (removed) | 0 (updated) | 431 (manual) :a4, 00:00, 525s section 05/03/2024 4 (removed) | 2 (updated) | 431 (manual) :a4, 00:00, 539s ```
June 2024
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 06/05/2024 0 (removed) | 69 (updated) | 978 (manual) :a4, 00:00, 555s section 06/07/2024 0 (removed) | 0 (updated) | 978 (manual) :a4, 00:00, 526s section 06/10/2024 0 (removed) | 0 (updated) | 1014 (manual) :a4, 00:00, 530s section 06/14/2024 0 (removed) | 0 (updated) | 0 (manual) :a4, 00:00, 547s section 06/21/2024 0 (removed) | 0 (updated) | 13 (manual) :a4, 00:00, 511s section 06/24/2024 0 (removed) | 0 (updated) | 74 (manual) :a4, 00:00, 511s ```
2024
```mermaid gantt title Mismatched Datasets per day (prod-only) dateFormat HH:mm axisFormat %H:%M section 07/01/2024 0 (removed) | 0 (updated) | 135 (manual) :a4, 00:00, 538s section 07/16/2024 0 (removed) | 0 (updated) | 183 (manual) :a4, 00:00, 534s section 08/19/2024 0 (removed) | 0 (updated) | 6 (manual) :a4, 00:00, 479s section 09/24/2024 0 (removed) | 0 (updated) | 174 (manual) :a4, 00:00, 537s section 10/23/2024 0 (removed) | 0 (updated) | 26(manual) :a4, 00:00, 547s section 11/14/2024 0 (removed) | 0 (updated) | 27(manual) :a4, 00:00, 540s section 11/19/2024 0 (removed) | 0 (updated) | 42(manual) :a4, 00:00, 571s ```
hkdctol commented 1 year ago

@FuhuXia will review and create issue.

nickumia-reisys commented 1 year ago

Not sure why db-solr-sync errored out on 6/22, but maybe cloud.gov needed to terminate the task for some reason? I think it was mostly finished doing what it needed to do. It's just weird..

image

FuhuXia commented 1 year ago
5306 packages without harvest_object need to be mannually deleted

hmmm... mannually -> manually typo.

This error means 5306 packages have discrepancies between db and solr, but this db-solr-sync script does not know how to handle them. Even you manually index them, CKAN will send bogus harvest_object_id to solr then you still end up with discrepancies again.

The count (5306) will not go away until we do two steps.

  1. run command ckan geodatagov harvest-object-relink. This will fix packages that has a good but not current harvest_object_id. After this, another run of db-solr-sync or a manual reindex will fix the package. We want to run this command manually when catalog-fetch is idling. A ongoing harvest job does not like his relink script.

  2. run a batch delete via api. This will purge those packages that have no harvest_object_id. All of our datasets are harvested and all should have a harvest_object_id. For those package without harvest_object_id, it is bad, catalog has no way to manage them, we should not hesitate to eliminate them. One scenario that can cause this kind of package is that the source of a duplicated package was removed, the good one of the duplicate was deleted on harvesting, the bad one stays behind, becoming a package without harvest_object_id. I use this script.

After the two steps run, the count should be 0, but it is expected to become hundreds then thousands again in a couple weeks.

nickumia-reisys commented 1 year ago

After the two steps run, the count should be 0, but it is expected to become hundreds then thousands again in a couple weeks.

"expected" ๐Ÿคจ .... suuuuurrre.