dandi / dandi-archive

DANDI API server and Web app
https://dandiarchive.org
13 stars 12 forks source link

getting timeouts, 500s, and good number of errors logged -- need help with analysis/triage #1117

Closed yarikoptic closed 2 years ago

yarikoptic commented 2 years ago

While backing up dandisets on drogon,

we keep running into various 500s, timeouts etc ```shell (dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ PATH=/home/dandi/git-annexes/10.20220525+git57-ge796080f3-1~ndall+1/usr/lib/git-annex.linux:$PATH python -m tools.backups2datalad --pdb -l WARNING -J 5 --target /mnt/backup/dandi/dandisets update-from-backup --zarr-target /mnt/backup/dandi/dandizarrs --backup-remote dandi-dandisets-dropbox --zarr-backup-remote dandi-dandizarrs-dropbox --gh-org dandisets --zarr-gh-org dandizarrs 000108 A newer version (0.40.1) of dandi/dandi-cli is available. You are using 0.40.0 2022-06-10T10:46:04-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 1.000000 seconds as it raised ReadTimeout: 2022-06-10T10:46:36-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 1.800000 seconds as it raised ConnectTimeout: 2022-06-10T10:47:04-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 3.240000 seconds as it raised ConnectTimeout: 2022-06-10T10:47:36-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 5.832000 seconds as it raised ConnectTimeout: 2022-06-10T10:48:27-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 10.497600 seconds as it raised ConnectTimeout: 2022-06-10T10:49:06-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 18.895680 seconds as it raised ConnectTimeout: 2022-06-10T10:50:02-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 34.012224 seconds as it raised ConnectTimeout: 2022-06-10T10:51:12-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 61.222003 seconds as it raised ConnectTimeout: 2022-06-10T10:53:01-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 110.199606 seconds as it raised ConnectTimeout: 2022-06-10T10:55:22-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 198.359290 seconds as it raised ConnectTimeout: 2022-06-10T10:59:12-0400 [ERROR ] backups2datalad Operation failed with exception: Traceback (most recent call last): File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/httpcore/backends/asyncio.py", line 101, in connect_tcp stream: anyio.abc.ByteStream = await anyio.connect_tcp( File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_core/_sockets.py", line 218, in connect_tcp await event.wait() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 658, in __aexit__ raise CancelledError asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/httpcore/_exceptions.py", line 8, in map_exceptions yield File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/httpcore/backends/asyncio.py", line 101, in connect_tcp stream: anyio.abc.ByteStream = await anyio.connect_tcp( File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_core/_tasks.py", line 118, in __exit__ raise TimeoutError TimeoutError ... ```

and I have difficulty establishing reliable dump of logs from heroku but did archive some, e.g. recent ones:

(dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ ls -ld 202206*
-rw------- 1 dandi dandi-internal        0 Jun  7 16:47 20220607-1647.log
-rw------- 1 dandi dandi-internal        0 Jun  7 16:48 20220607-1648.log
-rw------- 1 dandi dandi-internal   581793 Jun  7 17:01 20220607-1653.log
-rw------- 1 dandi dandi-internal  4370876 Jun  7 18:01 20220607-1701.log
-rw------- 1 dandi dandi-internal 29223827 Jun  8 12:01 20220608-1101.log
-rw------- 1 dandi dandi-internal 21157733 Jun  8 21:32 20220608-2001.log
-rw------- 1 dandi dandi-internal    34680 Jun 10 11:00 20220610-1052.log
-rw------- 1 dandi dandi-internal  4345281 Jun 10 12:01 20220610-1101.log
-rw------- 1 dandi dandi-internal  4251070 Jun 11 13:46 20220611-0401.log
-rw------- 1 dandi dandi-internal    35077 Jun 11 13:49 20220611-1347.log

where is the brief summary of errors without sql_error_code logged:

(dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -ih error 202206* | grep -v sql_error_code | sed -e 's,^.*+00:00 ,,g' | sort | uniq -c  | sort -n | nl | tail -n 20
   112        1 app[web.1]: [15:42:49] ERROR    Error calculating assetsSummary               version.py:230
   113        1 app[worker.1]: [2022-06-08 15:12:32,457: ERROR/ForkPoolWorker-5] Error calculating assetsSummary
   114        1 app[worker.1]: [2022-06-08 15:12:32,809: ERROR/ForkPoolWorker-5] Error calculating assetsSummary
   115        1 app[worker.1]: [2022-06-08 15:12:32,988: ERROR/ForkPoolWorker-5] dandiapi.api.tasks.validate_version_metadata[920529a0-b3d0-4ed4-a077-870fac9d1210]: Error while validating version 500
   116        1 app[worker.1]: [2022-06-08 15:12:33,226: ERROR/ForkPoolWorker-5] Error calculating assetsSummary
   117        1 app[worker.1]: [2022-06-08 15:43:33,730: ERROR/ForkPoolWorker-5] Error calculating assetsSummary
   118        1 app[worker.1]: [2022-06-08 15:43:33,957: ERROR/ForkPoolWorker-5] Error calculating assetsSummary
   119        1 app[worker.1]: [2022-06-08 15:43:34,242: ERROR/ForkPoolWorker-5] dandiapi.api.tasks.validate_version_metadata[4e26a56d-24fc-4345-896b-7e718e9f3c2a]: Error while validating version 500
   120        1 app[worker.1]: [2022-06-08 15:43:34,451: ERROR/ForkPoolWorker-5] Error calculating assetsSummary
   121        2 app[postgres.2774731]: [DATABASE] [10-33]  NULL, "status" = 'Pending', "validation_errors" = '[]' WHERE "api_version"."id" = 416
   122        2 app[postgres.2774731]: [DATABASE] [9-34]  NULL, "status" = 'Pending', "validation_errors" = '[]' WHERE "api_version"."id" = 416
   123        2 app[web.1]: raise ValueError("Provided metadata has
   124        2 app[web.1]: ValueError: Provided metadata has no schema
   125        2 app[worker.1]: dandischema.exceptions.JsonschemaValidationError: [<ValidationError: "'schemaKey' is a required property">]
   126        2 app[worker.1]: raise JsonschemaValidationError(error_list)
   127        6 app[worker.1]: raise ValueError("Provided metadata has no schema version")
   128        6 app[worker.1]: ValueError: Provided metadata has no schema version
   129      112 heroku[worker.1]: Error R14 (Memory quota exceeded)
   130      398 heroku[web.1]: Error R14 (Memory quota exceeded)
   131      526 heroku[manifest-worker.1]: Error R14 (Memory quota exceeded)

and for sql_error (including 0000 -- didn't check if legit or not)

(dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -ih error 202206* | grep sql_error_code | sed -e 's,^.*+00:00 ,,g' | sed -e 's,postgres\.[0-9]*,postgres.XXX,g' -e 's,\(\[[-,0-9]*\]\),[X-XX],g' -e 's,\(([-,0-9]*)\),(X\,XX),g' | sort | uniq -c  | sort -n | nl | tail -n 10
  3131        2 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 STATEMENT:  UPDATE "api_version" SET "created" = '2020-06-24T18:34:50.022000+00:00'::timestamptz, "modified" = '2022-06-07T21:58:20.235790+00:00'::timestamptz, "dandiset_id" = 26, "name" = 'Human brain cell census for BA 44/45', "metadata" = '{"id": "DANDI:000026/draft", "url": "https://dandiarchive.org/dandiset/000026/draft", "name": "Human brain cell census for BA 44/45", "about": [{"name": "Broca\u2019s Area", "schemaKey": "GenericType"}, {"name": "Motor Cortex", "schemaKey": "GenericType"}], "access": [{"status": "dandi:OpenAccess", "schemaKey": "AccessRequirements", "contactPoint": {"schemaKey": "ContactPoint"}}], "license": ["spdx:CC-BY-4.0"], "version": "draft", "@context": "https://raw.githubusercontent.com/dandi/schema/master/releases/0.6.0/context.json", "citation": "Mazzamuto, Giacomo; Costantini, Irene; Gavryusev, Vladislav; Castelli, Filippo Maria;
  3132        2 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 STATEMENT:  UPDATE "api_version" SET "created" = '2020-06-24T18:34:50.022000+00:00'::timestamptz, "modified" = '2022-06-07T21:59:21.697774+00:00'::timestamptz, "dandiset_id" = 26, "name" = 'Human brain cell census for BA 44/45', "metadata" = '{"id": "DANDI:000026/draft", "url": "https://dandiarchive.org/dandiset/000026/draft", "name": "Human brain cell census for BA 44/45", "about": [{"name": "Broca\u2019s Area", "schemaKey": "GenericType"}, {"name": "Motor Cortex", "schemaKey": "GenericType"}], "access": [{"status": "dandi:OpenAccess", "schemaKey": "AccessRequirements", "contactPoint": {"schemaKey": "ContactPoint"}}], "license": ["spdx:CC-BY-4.0"], "version": "draft", "@context": "https://raw.githubusercontent.com/dandi/schema/master/releases/0.6.0/context.json", "citation": "Mazzamuto, Giacomo; Costantini, Irene; Gavryusev, Vladislav; Castelli, Filippo Maria;
  3133        2 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 08P01 HINT:  This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.3.
  3134        2 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 08P01 LOG:  could not accept SSL connection: unsupported protocol
  3135        3 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 DETAIL:  Process holding the lock: 2161597. Wait queue: 2161700.
  3136        3 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 STATEMENT:  UPDATE "api_version" SET "created" = '2020-06-24T18:34:50.022000+00:00'::timestamptz, "modified" = '2022-06-07T21:00:17.827877+00:00'::timestamptz, "dandiset_id" = 26, "name" = 'Human brain cell census for BA 44/45', "metadata" = '{"id": "DANDI:000026/draft", "url": "https://dandiarchive.org/dandiset/000026/draft", "name": "Human brain cell census for BA 44/45", "about": [{"name": "Broca\u2019s Area", "schemaKey": "GenericType"}, {"name": "Motor Cortex", "schemaKey": "GenericType"}], "access": [{"status": "dandi:OpenAccess", "schemaKey": "AccessRequirements", "contactPoint": {"schemaKey": "ContactPoint"}}], "license": ["spdx:CC-BY-4.0"], "version": "draft", "@context": "https://raw.githubusercontent.com/dandi/schema/master/releases/0.6.0/context.json", "citation": "Mazzamuto, Giacomo; Costantini, Irene; Gavryusev, Vladislav; Castelli, Filippo Maria;
  3137        5 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 DETAIL:  Process holding the lock: 2159107. Wait queue: 2159246.
  3138        7 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 STATEMENT:  SELECT "api_asset"."id", "api_asset"."created", "api_asset"."modified", "api_asset"."asset_id", "api_asset"."path", "api_asset"."blob_id", "api_asset"."embargoed_blob_id", "api_asset"."zarr_id", "api_asset"."metadata", "api_asset"."status", "api_asset"."validation_errors", "api_asset"."previous_id", "api_asset"."published", "api_assetblob"."id", "api_assetblob"."created", "api_assetblob"."modified", "api_assetblob"."blob_id", "api_assetblob"."sha256", "api_assetblob"."etag", "api_assetblob"."size", "api_assetblob"."blob", "api_embargoedassetblob"."id", "api_embargoedassetblob"."created", "api_embargoedassetblob"."modified", "api_embargoedassetblob"."blob_id", "api_embargoedassetblob"."sha256", "api_embargoedassetblob"."etag", "api_embargoedassetblob"."size", "api_embargoedassetblob"."blob", "api_embargoedassetblob"."dandiset_id",
  3139       17 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 LOG:  checkpoint starting: time
  3140      113 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 00000 CONTEXT:  while updating tuple (X,XX) in relation "api_version"

and without 00000

(dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -ih error 202206* | grep sql_error_code | sed -e 's,^.*+00:00 ,,g' | sed -e 's,postgres\.[0-9]*,postgres.XXX,g' -e 's,\(\[[-,0-9]*\]\),[X-XX],g' -e 's,\(([-,0-9]*)\),(X\,XX),g' | grep -v 'sql_error_code = 00000' | sort | uniq -c  | sort -n | nl 
     1        1 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 0A000 FATAL:  unsupported frontend protocol 0.0: server supports 2.0 to 3.0
     2        1 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 0A000 FATAL:  unsupported frontend protocol 16.0: server supports 2.0 to 3.0
     3        1 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 0A000 FATAL:  unsupported frontend protocol 255.255: server supports 2.0 to 3.0
     4        1 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 0A000 time_ms = "2022-06-11 17:41:19.225 UTC" pid="9031" proc_start_time="2022-06-11 17:41:19 UTC" session_id="62a4d3bf.2347" vtid="" tid="0" log_line="1" database="[unknown]" connection_source="192.241.221.234(X,XX)" user="[unknown]" application_name="[unknown]" FATAL:  unsupported frontend protocol 16.0: server supports 2.0 to 3.0
     5        1 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 28000 FATAL:  no PostgreSQL user name specified in startup packet
     6        2 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 08P01 HINT:  This may indicate that the client does not support any SSL protocol version between TLSv1.2 and TLSv1.3.
     7        2 app[postgres.XXX]: [DATABASE] [X-XX]  sql_error_code = 08P01 LOG:  could not accept SSL connection: unsupported protocol
the same non-0 sqlerrors in may ```shell (dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -ih error 202205* | grep sql_error_code | sed -e 's,^.*+00:00 ,,g' | sed -e 's,postgres\.[0-9]*,postgres.XXX,g' -e 's,\(\[[-,0-9]*\]\),[X-XX],g' -e 's,\(([-,0-9]*)\),(X\,XX),g' | grep -v 'sql_error_code = 00000' | sort | uniq -c | sort -n | nl 1 2 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28000 FATAL: no pg_hba.conf entry for host "106.75.190.116", user "postgres", database "template0", SSL off 2 3 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 0A000 FATAL: unsupported frontend protocol 16.0: server supports 2.0 to 3.0 3 5 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28000 FATAL: no pg_hba.conf entry for host "193.106.191.48", user "postgres", database "bbbbbbb", SSL off 4 8 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28000 FATAL: no pg_hba.conf entry for host "128.14.141.42", user "gmcnkN", database "--help", SSL off 5 8 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28000 FATAL: no pg_hba.conf entry for host "128.14.141.42", user "postgres", database "postgres", SSL off 6 8 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28P01 DETAIL: Connection matched pg_hba.conf line 7: "hostssl all all 0.0.0.0/0 md5 " 7 8 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28P01 FATAL: password authentication failed for user "postgres" 8 11 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 0A000 FATAL: unsupported frontend protocol 0.0: server supports 2.0 to 3.0 9 11 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 0A000 FATAL: unsupported frontend protocol 255.255: server supports 2.0 to 3.0 10 11 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28000 FATAL: no PostgreSQL user name specified in startup packet 11 43 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 08P01 LOG: invalid length of startup packet ```
edit1: those few which we observed in Aug of last 2021 year ```shell (dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -ih error 202108* | grep sql_error_code | sed -e 's,^.*+00:00 ,,g' | sed -e 's,postgres\.[0-9]*,postgres.XXX,g' -e 's,\(\[[-,0-9]*\]\),[X-XX],g' -e 's,\(([-,0-9]*)\),(X\,XX),g' | grep -v 'sql_error_code = 00000' | sort | uniq -c | sort -n | nl 1 1 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 08P01 LOG: incomplete message from client 2 1 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 08P01 LOG: incomplete startup packet 3 1 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28000 FATAL: no pg_hba.conf entry for host "183.136.226.2", user "postgres", database "template0", SSL off 4 2 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 08P01 LOG: invalid length of startup packet 5 2 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 0A000 FATAL: unsupported frontend protocol 16.0: server supports 2.0 to 3.0 6 3 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 0A000 FATAL: unsupported frontend protocol 0.0: server supports 2.0 to 3.0 7 3 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 0A000 FATAL: unsupported frontend protocol 255.255: server supports 2.0 to 3.0 8 3 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28000 FATAL: no PostgreSQL user name specified in startup packet ```

IMHO someone with better knowledge of those systems should review/analyze and report on either they are all "benign" or some require attention/action.

waxlamp commented 2 years ago

This looks to me like a worker crashed when it used too much memory, and that caused other errors to appear.

@mvandenburgh has been working on solving memory usage errors, so I wonder if we will seeing fewer errors like this.

@danlamanna, could you look these over and add your own insight as to what might be going on?

mvandenburgh commented 2 years ago

I see it mentions assetSummary calculation - I fixed a big memory issue with that in https://github.com/dandi/dandi-archive/pull/1159, that might have fixed this.

yarikoptic commented 2 years ago
FWIW I do not see assetSummary errors in today's logs (so might indeed be mitigated by now -- yeay!) but some new types of errors seems to bubble up - never a boring day ```shell (base) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -ih error 20220706* | grep -v sql_error_code | sed -e 's,^.*+00:00 ,,g' | sort | uniq -c | sort -n | nl | tail -n 20 22084 2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 3 messages since 2022-07-06T17:57:25.862123+00:00. 22085 2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 3 messages since 2022-07-06T17:58:25.919928+00:00. 22086 2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 3 messages since 2022-07-06T18:58:58.728623+00:00. 22087 2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 3 messages since 2022-07-06T18:59:28.649057+00:00. 22088 2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 3 messages since 2022-07-06T19:59:31.397667+00:00. 22089 2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 4 messages since 2022-07-06T17:53:12.387464+00:00. 22090 2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 4 messages since 2022-07-06T17:59:25.922625+00:00. 22091 2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 4 messages since 2022-07-06T18:59:25.730034+00:00. 22092 2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 6 messages since 2022-07-06T18:59:23.790925+00:00. 22093 2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 8 messages since 2022-07-06T18:59:35.182099+00:00. 22094 2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 9 messages since 2022-07-06T16:59:57.066035+00:00. 22095 2 heroku[logplex]: Error L10 (output buffer overflow): drain 'd.3a7ec4ce-ec2a-4a97-9063-bac84479d38a' dropped 9 messages since 2022-07-06T18:59:26.93824+00:00. 22096 12 app[web.2]: raise ValueError("Provided metadata has 22097 12 app[web.2]: ValueError: Provided metadata has no schema 22098 16 app[web.1]: raise ValueError("Provided metadata has 22099 16 app[web.1]: ValueError: Provided metadata has no schema 22100 17 app[worker.1]: dandischema.exceptions.JsonschemaValidationError: [] 22101 17 app[worker.1]: raise JsonschemaValidationError(error_list) 22102 51 app[worker.1]: raise ValueError("Provided metadata has no schema version") 22103 51 app[worker.1]: ValueError: Provided metadata has no schema version ```

with prevalent one to be ValueError: Provided metadata has no schema version . looking at the sample might explain why we don't see any other error from asset summary calculation:

2022-07-06T20:55:34.486436+00:00 app[worker.1]: [2022-07-06 20:55:34,486: INFO/ForkPoolWorker-4] Error calculating assetsSummary
2022-07-06T20:55:34.486438+00:00 app[worker.1]: Traceback (most recent call last):
2022-07-06T20:55:34.486438+00:00 app[worker.1]: File "/app/dandiapi/api/tasks/__init__.py", line 138, in validate_version_metadata
2022-07-06T20:55:34.486439+00:00 app[worker.1]: validate(metadata, schema_key='PublishedDandiset', json_validation=True)
2022-07-06T20:55:34.486439+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.9/site-packages/dandischema/metadata.py", line 188, in validate
2022-07-06T20:55:34.486440+00:00 app[worker.1]: _validate_obj_json(obj, schema, missing_ok)
2022-07-06T20:55:34.486442+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.9/site-packages/dandischema/metadata.py", line 117, in _validate_obj_json
2022-07-06T20:55:34.486442+00:00 app[worker.1]: raise JsonschemaValidationError(error_list)
2022-07-06T20:55:34.486442+00:00 app[worker.1]: dandischema.exceptions.JsonschemaValidationError: [<ValidationError: "'schemaKey' is a required property">]
2022-07-06T20:55:34.486442+00:00 app[worker.1]:
2022-07-06T20:55:34.486443+00:00 app[worker.1]: During handling of the above exception, another exception occurred:
2022-07-06T20:55:34.486443+00:00 app[worker.1]:
2022-07-06T20:55:34.486443+00:00 app[worker.1]: Traceback (most recent call last):
2022-07-06T20:55:34.486444+00:00 app[worker.1]: File "/app/dandiapi/api/models/version.py", line 229, in _populate_metadata
2022-07-06T20:55:34.486444+00:00 app[worker.1]: summary = aggregate_assets_summary(
2022-07-06T20:55:34.486444+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.9/site-packages/dandischema/metadata.py", line 334, in aggregate_assets_summary
2022-07-06T20:55:34.486444+00:00 app[worker.1]: _add_asset_to_stats(meta, stats)
2022-07-06T20:55:34.486444+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.9/site-packages/dandischema/metadata.py", line 266, in _add_asset_to_stats
2022-07-06T20:55:34.486445+00:00 app[worker.1]: raise ValueError("Provided metadata has no schema version")
2022-07-06T20:55:34.486445+00:00 app[worker.1]: ValueError: Provided metadata has no schema version
2022-07-06T20:55:34.491628+00:00 app[worker.1]: [2022-07-06 20:55:34,491: INFO/ForkPoolWorker-4] Task dandiapi.api.tasks.validate_version_metadata[7f707fb6-0581-4eb6-9465-596186c4b2aa] succeeded in 0.9852942840661854s: None

I think it would be useful at File "/app/dandiapi/api/models/version.py", line 229 to try/except and issue an ERROR message with details on what dandiset this is happening?

and what is that buffer L10 error? we have over 20k log lines today only

waxlamp commented 2 years ago

L10 errors are transient spikes in log volume that the log consumer can't keep up with (and must therefore drop some messages). I gather it can happen if a log producer emits a sudden high volume of logs at too high a rate. It's disconnected from total daily log capacity.

As for the ValueError I think a better solution is to address https://github.com/dandi/dandi-schema/issues/127 (specifically, see this comment. If that ValueError actually should be an unhandled exception because there is something bad happening, then solving that issue would produce Sentry reports, complete with context information about dandisets, etc., that we could respond to more concretely. I'm going to close this issue in favor of that approach (and then we can address specific issues with validation as they arise).

yarikoptic commented 2 years ago

for now I think it is ok to assume that ValueError raised while calling a validation function is due to validation failure. I think it is needed to investigate how ValueError: Provided metadata has no schema version could come about (check the logs/traceback there to see how it got there) since AFAIK dandi-cli shouldn't provide such records, so might be somewhere on web frontend.

satra commented 2 years ago

also when a metadata of an asset is saved on the server side it should inject latest schemaVersion and validate if not provided or reject the post.

satra commented 2 years ago

i.e. it should never save metadata without a schemaVersion.