Closed yarikoptic closed 2 years ago
This looks to me like a worker crashed when it used too much memory, and that caused other errors to appear.
@mvandenburgh has been working on solving memory usage errors, so I wonder if we will seeing fewer errors like this.
@danlamanna, could you look these over and add your own insight as to what might be going on?
I see it mentions assetSummary
calculation - I fixed a big memory issue with that in https://github.com/dandi/dandi-archive/pull/1159, that might have fixed this.
with prevalent one to be ValueError: Provided metadata has no schema version
. looking at the sample might explain why we don't see any other error from asset summary calculation:
2022-07-06T20:55:34.486436+00:00 app[worker.1]: [2022-07-06 20:55:34,486: INFO/ForkPoolWorker-4] Error calculating assetsSummary
2022-07-06T20:55:34.486438+00:00 app[worker.1]: Traceback (most recent call last):
2022-07-06T20:55:34.486438+00:00 app[worker.1]: File "/app/dandiapi/api/tasks/__init__.py", line 138, in validate_version_metadata
2022-07-06T20:55:34.486439+00:00 app[worker.1]: validate(metadata, schema_key='PublishedDandiset', json_validation=True)
2022-07-06T20:55:34.486439+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.9/site-packages/dandischema/metadata.py", line 188, in validate
2022-07-06T20:55:34.486440+00:00 app[worker.1]: _validate_obj_json(obj, schema, missing_ok)
2022-07-06T20:55:34.486442+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.9/site-packages/dandischema/metadata.py", line 117, in _validate_obj_json
2022-07-06T20:55:34.486442+00:00 app[worker.1]: raise JsonschemaValidationError(error_list)
2022-07-06T20:55:34.486442+00:00 app[worker.1]: dandischema.exceptions.JsonschemaValidationError: [<ValidationError: "'schemaKey' is a required property">]
2022-07-06T20:55:34.486442+00:00 app[worker.1]:
2022-07-06T20:55:34.486443+00:00 app[worker.1]: During handling of the above exception, another exception occurred:
2022-07-06T20:55:34.486443+00:00 app[worker.1]:
2022-07-06T20:55:34.486443+00:00 app[worker.1]: Traceback (most recent call last):
2022-07-06T20:55:34.486444+00:00 app[worker.1]: File "/app/dandiapi/api/models/version.py", line 229, in _populate_metadata
2022-07-06T20:55:34.486444+00:00 app[worker.1]: summary = aggregate_assets_summary(
2022-07-06T20:55:34.486444+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.9/site-packages/dandischema/metadata.py", line 334, in aggregate_assets_summary
2022-07-06T20:55:34.486444+00:00 app[worker.1]: _add_asset_to_stats(meta, stats)
2022-07-06T20:55:34.486444+00:00 app[worker.1]: File "/app/.heroku/python/lib/python3.9/site-packages/dandischema/metadata.py", line 266, in _add_asset_to_stats
2022-07-06T20:55:34.486445+00:00 app[worker.1]: raise ValueError("Provided metadata has no schema version")
2022-07-06T20:55:34.486445+00:00 app[worker.1]: ValueError: Provided metadata has no schema version
2022-07-06T20:55:34.491628+00:00 app[worker.1]: [2022-07-06 20:55:34,491: INFO/ForkPoolWorker-4] Task dandiapi.api.tasks.validate_version_metadata[7f707fb6-0581-4eb6-9465-596186c4b2aa] succeeded in 0.9852942840661854s: None
I think it would be useful at File "/app/dandiapi/api/models/version.py", line 229
to try/except and issue an ERROR message with details on what dandiset this is happening?
and what is that buffer L10 error? we have over 20k log lines today only
L10
errors are transient spikes in log volume that the log consumer can't keep up with (and must therefore drop some messages). I gather it can happen if a log producer emits a sudden high volume of logs at too high a rate. It's disconnected from total daily log capacity.
As for the ValueError
I think a better solution is to address https://github.com/dandi/dandi-schema/issues/127 (specifically, see this comment. If that ValueError
actually should be an unhandled exception because there is something bad happening, then solving that issue would produce Sentry reports, complete with context information about dandisets, etc., that we could respond to more concretely. I'm going to close this issue in favor of that approach (and then we can address specific issues with validation as they arise).
for now I think it is ok to assume that ValueError
raised while calling a validation function is due to validation failure. I think it is needed to investigate how ValueError: Provided metadata has no schema version
could come about (check the logs/traceback there to see how it got there) since AFAIK dandi-cli shouldn't provide such records, so might be somewhere on web frontend.
also when a metadata of an asset is saved on the server side it should inject latest schemaVersion and validate if not provided or reject the post.
i.e. it should never save metadata without a schemaVersion.
While backing up dandisets on drogon,
we keep running into various 500s, timeouts etc
```shell (dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ PATH=/home/dandi/git-annexes/10.20220525+git57-ge796080f3-1~ndall+1/usr/lib/git-annex.linux:$PATH python -m tools.backups2datalad --pdb -l WARNING -J 5 --target /mnt/backup/dandi/dandisets update-from-backup --zarr-target /mnt/backup/dandi/dandizarrs --backup-remote dandi-dandisets-dropbox --zarr-backup-remote dandi-dandizarrs-dropbox --gh-org dandisets --zarr-gh-org dandizarrs 000108 A newer version (0.40.1) of dandi/dandi-cli is available. You are using 0.40.0 2022-06-10T10:46:04-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 1.000000 seconds as it raised ReadTimeout: 2022-06-10T10:46:36-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 1.800000 seconds as it raised ConnectTimeout: 2022-06-10T10:47:04-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 3.240000 seconds as it raised ConnectTimeout: 2022-06-10T10:47:36-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 5.832000 seconds as it raised ConnectTimeout: 2022-06-10T10:48:27-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 10.497600 seconds as it raised ConnectTimeout: 2022-06-10T10:49:06-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 18.895680 seconds as it raised ConnectTimeout: 2022-06-10T10:50:02-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 34.012224 seconds as it raised ConnectTimeout: 2022-06-10T10:51:12-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 61.222003 seconds as it raised ConnectTimeout: 2022-06-10T10:53:01-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 110.199606 seconds as it raised ConnectTimeout: 2022-06-10T10:55:22-0400 [WARNING ] backups2datalad Retrying GET request to https://api.dandiarchive.org/api/assets/6164186b-16c2-4bc0-9036-6804f5934019/ in 198.359290 seconds as it raised ConnectTimeout: 2022-06-10T10:59:12-0400 [ERROR ] backups2datalad Operation failed with exception: Traceback (most recent call last): File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/httpcore/backends/asyncio.py", line 101, in connect_tcp stream: anyio.abc.ByteStream = await anyio.connect_tcp( File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_core/_sockets.py", line 218, in connect_tcp await event.wait() File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 658, in __aexit__ raise CancelledError asyncio.exceptions.CancelledError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/httpcore/_exceptions.py", line 8, in map_exceptions yield File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/httpcore/backends/asyncio.py", line 101, in connect_tcp stream: anyio.abc.ByteStream = await anyio.connect_tcp( File "/home/dandi/miniconda3/envs/dandisets/lib/python3.8/site-packages/anyio/_core/_tasks.py", line 118, in __exit__ raise TimeoutError TimeoutError ... ```and I have difficulty establishing reliable dump of logs from heroku but did archive some, e.g. recent ones:
where is the brief summary of errors without
sql_error_code
logged:and for sql_error (including 0000 -- didn't check if legit or not)
and without 00000
the same non-0 sqlerrors in may
```shell (dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -ih error 202205* | grep sql_error_code | sed -e 's,^.*+00:00 ,,g' | sed -e 's,postgres\.[0-9]*,postgres.XXX,g' -e 's,\(\[[-,0-9]*\]\),[X-XX],g' -e 's,\(([-,0-9]*)\),(X\,XX),g' | grep -v 'sql_error_code = 00000' | sort | uniq -c | sort -n | nl 1 2 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28000 FATAL: no pg_hba.conf entry for host "106.75.190.116", user "postgres", database "template0", SSL off 2 3 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 0A000 FATAL: unsupported frontend protocol 16.0: server supports 2.0 to 3.0 3 5 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28000 FATAL: no pg_hba.conf entry for host "193.106.191.48", user "postgres", database "bbbbbbb", SSL off 4 8 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28000 FATAL: no pg_hba.conf entry for host "128.14.141.42", user "gmcnkN", database "--help", SSL off 5 8 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28000 FATAL: no pg_hba.conf entry for host "128.14.141.42", user "postgres", database "postgres", SSL off 6 8 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28P01 DETAIL: Connection matched pg_hba.conf line 7: "hostssl all all 0.0.0.0/0 md5 " 7 8 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28P01 FATAL: password authentication failed for user "postgres" 8 11 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 0A000 FATAL: unsupported frontend protocol 0.0: server supports 2.0 to 3.0 9 11 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 0A000 FATAL: unsupported frontend protocol 255.255: server supports 2.0 to 3.0 10 11 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28000 FATAL: no PostgreSQL user name specified in startup packet 11 43 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 08P01 LOG: invalid length of startup packet ```edit1: those few which we observed in Aug of last 2021 year
```shell (dandisets) dandi@drogon:/mnt/backup/dandi/heroku-logs/dandi-api$ grep -ih error 202108* | grep sql_error_code | sed -e 's,^.*+00:00 ,,g' | sed -e 's,postgres\.[0-9]*,postgres.XXX,g' -e 's,\(\[[-,0-9]*\]\),[X-XX],g' -e 's,\(([-,0-9]*)\),(X\,XX),g' | grep -v 'sql_error_code = 00000' | sort | uniq -c | sort -n | nl 1 1 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 08P01 LOG: incomplete message from client 2 1 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 08P01 LOG: incomplete startup packet 3 1 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28000 FATAL: no pg_hba.conf entry for host "183.136.226.2", user "postgres", database "template0", SSL off 4 2 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 08P01 LOG: invalid length of startup packet 5 2 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 0A000 FATAL: unsupported frontend protocol 16.0: server supports 2.0 to 3.0 6 3 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 0A000 FATAL: unsupported frontend protocol 0.0: server supports 2.0 to 3.0 7 3 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 0A000 FATAL: unsupported frontend protocol 255.255: server supports 2.0 to 3.0 8 3 app[postgres.XXX]: [DATABASE] [X-XX] sql_error_code = 28000 FATAL: no PostgreSQL user name specified in startup packet ```IMHO someone with better knowledge of those systems should review/analyze and report on either they are all "benign" or some require attention/action.