CatalogueOfLife / backend

Complete backend of COL ChecklistBank
Apache License 2.0
15 stars 11 forks source link

GSD1010 export missing 17,957 species #380

Closed gdower closed 5 years ago

gdower commented 5 years ago

The scientific_names table from the CoL+ exporter only contains 15,975 records for GSD1010 (FishBase), when it should have 33,932 living species and 55 living infraspecies:

mysql> SELECT count(*) FROM colplus_export.scientific_names WHERE database_id=1010 AND sp2000_status_id IN (1,4);
+----------+
| count(*) |
+----------+
|    15975 |
+----------+
1 row in set (0.53 sec)

CoL+ vs CoL- comparison

The import metrics page reports acef:AcceptedSpecies=33,932 and acef:AcceptedInfraSpecificTaxa=55, so I think this is likely a bug with the CoL+ to Assembly_Global exporter.

mdoering commented 5 years ago

As I wiped prod today I cannot verify, but I would guess it is rather related to the CoL assembly (sync) then the export itself which is GSD agnostic.

gdower commented 5 years ago

I tried to re-sync the sectors for GSD1010 and one did fail while trying to re-sync, but I didn't find any errors in the Kibana logs for any of those sectors before I re-synced:

PSQLException: ERROR: deadlock detected Detail: Process 25428 waits for AccessShareLock on relation 21549042 of database 21548559; blocked by process 26838. Process 26838 waits for AccessExclusiveLock on relation 21549072 of database 21548559; blocked by process 25428. Hint: See server log for query details.&_a=(columns:!(_source),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWGthDPVf8lu3pmEwvFE,key:sector,negate:!f,type:phrase,value:'6813'),query:(match:(sector:(query:6813,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWGthDPVf8lu3pmEwvFE,key:attempt,negate:!f,type:phrase,value:'1'),query:(match:(attempt:(query:1,type:phrase)))),('$state':(store:appState),meta:(alias:!n,disabled:!f,index:AWGthDPVf8lu3pmEwvFE,key:environment,negate:!f,type:phrase,value:col-prod),query:(match:(environment:(query:col-prod,type:phrase))))),index:AWGthDPVf8lu3pmEwvFE,interval:auto,query:(match_all:()),sort:!('@timestamp',desc)))

mdoering commented 5 years ago

The importer looks fine with 33,932 living species. I am syncing again the 6 sectors on prod right now

mdoering commented 5 years ago

Results: 102.501 usages; 39.748 taxa; 33.931 acc species Nearly perfect

mdoering commented 5 years ago

@gdower how does the export look like, can we close the issue?

gdower commented 5 years ago

Conversion probably won't finish until tomorrow, but I'll re-open this if we have the same problem. Before the CoL+ clearinghouse import statistics were correct but the CoL+ export statistics were incorrect.