Closed yarikoptic closed 1 year ago
@yarikoptic What kind of "explicit control"?
Eg a max number of open http connections etc
@yarikoptic Do you mean setting that for the process with some sort of system call or via some functionality of the httpx library (which I don't think it has)?
I don't know how or what. I just see a problem that we are not getting responses for some requests for some reason, and hypothesized that it might be due to number of connections. Probably it is not the reason since ATM that process is not that busy on establishing outgoing connections. Interestingly there is nothing else logged about a sample "problematic" asset:
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets/.git/dandi/backups2datalad$ grep 00cf862d-a9f8-4f8e-949a-a33a738e2cde 2022.11.10.16.42.36Z.log
2022-11-10T14:48:21-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/00cf862d-a9f8-4f8e-949a-a33a738e2cde/info/ in 0.961349 seconds as it raised ConnectTimeout:
2022-11-10T14:48:40-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/00cf862d-a9f8-4f8e-949a-a33a738e2cde/info/ in 1.987405 seconds as it raised ConnectTimeout:
2022-11-10T14:48:59-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/00cf862d-a9f8-4f8e-949a-a33a738e2cde/info/ in 4.089250 seconds as it raised ConnectTimeout:
2022-11-10T14:49:21-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/00cf862d-a9f8-4f8e-949a-a33a738e2cde/info/ in 7.860070 seconds as it raised ConnectTimeout:
2022-11-10T14:49:48-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/00cf862d-a9f8-4f8e-949a-a33a738e2cde/info/ in 15.363501 seconds as it raised ConnectTimeout:
2022-11-10T14:50:24-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/00cf862d-a9f8-4f8e-949a-a33a738e2cde/info/ in 32.712638 seconds as it raised ConnectTimeout:
2022-11-10T14:51:15-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/00cf862d-a9f8-4f8e-949a-a33a738e2cde/info/ in 63.515849 seconds as it raised ConnectTimeout:
2022-11-10T14:52:39-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/00cf862d-a9f8-4f8e-949a-a33a738e2cde/info/ in 125.433938 seconds as it raised ConnectTimeout:
2022-11-10T14:55:07-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/00cf862d-a9f8-4f8e-949a-a33a738e2cde/info/ in 256.114759 seconds as it raised ConnectTimeout:
so it is just that question remains -- why it is timing out, and how it could be solved.
@yarikoptic
that process finally errored out... please check its log for possible hints in relation to what you found / what to investigate. I hard reset 000108 and restarted with --workers 2
.
@yarikoptic All I can see in the logs is that the logging added in #294 shows that the code at that point is encountering a CancelledError
, which means that some other task in the task group raised an unhandled exception. All of the tracebacks are either CancelledError
s or AssertionError: feed_data after feed_eof
with no indication as to the original error that started everything.
FTR time python -m tools.backups2datalad -l INFO --backup-root /mnt/backup/dandi --config tools/backups2datalad.cfg.yaml update-from-backup --workers 2 000108
finished just fine in 1004m18.644s
and state of 000108 6017a827015b20b1eb7b792944db33933c208e6a (from Nov 12). Rerunning again to see timing and if there any change
did another run which took real 1006m18.585s and at the end resulted in
commit ad0bf0d5689af1bb568cdf49d395c08c263cc177 (HEAD -> draft, github/draft, github/HEAD)
Author: DANDI User <info@dandiarchive.org>
Date: Mon Nov 14 18:53:21 2022 +0000
[backups2datalad] 1 file updated
.dandi/assets-state.json | 2 +-
.dandi/assets.json | 566 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------------------------------------
dandiset.yaml | 2 +-
sub-MITU01/ses-20220322h10m15s10/micr/sub-MITU01_ses-20220322h10m15s10_sample-15_stain-NN_run-1_chunk-5_SPIM.ome.zarr | 2 +-
4 files changed, 286 insertions(+), 286 deletions(-)
ha -- I found uncommitted
(dandisets) dandi@drogon:/mnt/backup/dandi/dandisets$ git diff -- tools
diff --git a/tools/backups2datalad/aioutil.py b/tools/backups2datalad/aioutil.py
index c1ea685..a47eec1 100644
--- a/tools/backups2datalad/aioutil.py
+++ b/tools/backups2datalad/aioutil.py
@@ -117,6 +117,9 @@ async def arequest(
**kwargs: Any,
) -> httpx.Response:
waits = exp_wait(attempts=15, base=2)
+ # custom timeout if was not specified to wait longer in hope to overcome
+ # https://github.com/dandi/dandisets/issues/298 and alike
+ kwargs.setdefault('timeout', 60)
while True:
try:
r = await client.request(method, url, follow_redirects=True, **kwargs)
which is apparently was there and used since
-rw-r--r-- 1 dandi dandi 11375 Nov 12 17:50 tools/backups2datalad/aioutil.py
and I think I have not seen any ConnectTimeout since then and the last one among emails was Nov 11th. I will stash this change for now. And if we get one agai n-- we should apply this diff.
got to try backup on 000108 again -- full of
action summary:
publish (ok: 2)
2023-01-30T19:27:39-0500 [WARNING ] backups2datalad: Retrying PATCH request to https://api.github.com/repos/dandizarrs/20acbea4-af85-4858-9262-1233919950af in 1.021389 seconds as it raised ConnectTimeout:
2023-01-30T19:27:50-0500 [WARNING ] backups2datalad: Retrying PATCH request to https://api.github.com/repos/dandizarrs/20acbea4-af85-4858-9262-1233919950af in 2.072617 seconds as it raised ConnectTimeout:
2023-01-30T19:28:02-0500 [WARNING ] backups2datalad: Retrying PATCH request to https://api.github.com/repos/dandizarrs/20acbea4-af85-4858-9262-1233919950af in 4.050783 seconds as it raised ConnectTimeout:
2023-01-30T19:28:17-0500 [WARNING ] backups2datalad: Retrying PATCH request to https://api.github.com/repos/dandizarrs/20acbea4-af85-4858-9262-1233919950af in 8.262469 seconds as it raised ConnectTimeout:
2023-01-30T19:39:03-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/ebbc8771-b6b0-4282-b0b8-6318dcf1b031/info/ in 0.964212 seconds as it raised ConnectTimeout:
2023-01-30T19:39:16-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/ebbc8771-b6b0-4282-b0b8-6318dcf1b031/info/ in 2.056938 seconds as it raised ConnectTimeout:
2023-01-30T19:39:32-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/ebbc8771-b6b0-4282-b0b8-6318dcf1b031/info/ in 3.979117 seconds as it raised ConnectTimeout:
2023-01-30T19:39:47-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/ebbc8771-b6b0-4282-b0b8-6318dcf1b031/info/ in 7.958466 seconds as it raised ConnectTimeout:
2023-01-30T19:40:07-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/ebbc8771-b6b0-4282-b0b8-6318dcf1b031/info/ in 16.521691 seconds as it raised ConnectTimeout:
2023-01-30T19:40:34-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/ebbc8771-b6b0-4282-b0b8-6318dcf1b031/info/ in 30.588195 seconds as it raised ConnectTimeout:
2023-01-30T19:41:13-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/ebbc8771-b6b0-4282-b0b8-6318dcf1b031/info/ in 64.702269 seconds as it raised ConnectTimeout:
2023-01-30T19:42:32-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/ebbc8771-b6b0-4282-b0b8-6318dcf1b031/info/ in 124.615021 seconds as it raised ConnectTimeout:
2023-01-30T19:44:48-0500 [WARNING ] backups2datalad: Retrying GET request to /assets/ebbc8771-b6b0-4282-b0b8-6318dcf1b031/info/ in 263.498779 seconds as it raised ConnectTimeout:
so should indeed try that stashed diff next (after this one fails overall I guess eventually)
FTR, got some dirty indeed and dirty with really good number of changes not committed, e.g.
(dandisets) dandi@drogon:/mnt/backup/dandi/dandizarrs$ grep dirty /tmp/dirty-zarrs-6.log
./bd2693a8-556b-47f7-802c-1636d1837152 is dirty
./7675cdb3-de71-406a-af56-796eff34d1c4 is dirty
./19fef967-5afc-47d5-8538-d17be80da154 is dirty
./93bcbcb1-561d-487e-83d4-efc4d5ba0c6a is dirty
./5a0d5fbf-d061-447b-a020-2e9f8e58ed7e is dirty
./ca44da08-c398-4536-abe1-8a258d7898bc is dirty
./a77e1f73-36f5-4321-b1cc-ed27349fd461 is dirty
./79760495-3796-456d-a0a6-a626c8681723 is dirty
./8054d554-4fdc-4dd4-932c-8f2fc15076a0 is dirty
./48d7ec90-ccc7-4254-afe9-cf5bbdde88ad is dirty
./ae0337af-6aa4-4bac-82c4-d70ac01fc7c5 is dirty
./9b8c48d0-053a-49ef-aca0-fcfefe3ebb27 is dirty
./bb4b1d2d-1fbc-46e8-b112-e528e704178a is dirty
./8d1a0243-46ff-4e6c-88d9-ca1bfe8c7809 is dirty
./c2266d0b-04eb-4401-bd60-d93ff067b7ae is dirty
./941bcd8d-00a2-455a-bd85-ddf6e52cff55 is dirty
./5c984d01-22e2-4cdb-8858-881c7f8ef1a6 is dirty
./873100dd-7816-40ba-ac3e-ce6982ecd520 is dirty
./165bd81a-4f55-4b08-9843-caa9551dc7f0 is dirty
./48c62eba-c1bf-4a4e-85d1-d8fb438b8dae is dirty
./5d26e7d2-daa8-4284-90fd-ff7cbdfc3b26 is dirty
./56f7c084-2e47-4f72-9de7-c196ae6342eb is dirty
./ae8b13cc-8da2-4242-be38-311a2782e733 is dirty
./67b00725-f0a4-4e9f-9cf9-38972d5a0728 is dirty
./3db9d18e-7cf5-483c-8f77-cb64db873001 is dirty
./65a67a8b-a04a-4cdc-9e90-0b5cd85e5518 is dirty
(dandisets) dandi@drogon:/mnt/backup/dandi/dandizarrs$ git -C ./bd2693a8-556b-47f7-802c-1636d1837152 status | head
On branch draft
Your branch is up to date with 'github/draft'.
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: .zattrs
new file: 0/0/0/0/0/0
new file: 0/0/0/0/0/1
new file: 0/0/0/0/0/10
new file: 0/0/0/0/0/100
will be hard resetting now...
let's consider that cb43b85b717035e812ab1e6f8ae046dc4eb6b4d1 mitigated it -- I have not seen those for the last runs
We observe
although if I run from the same box a simple curl command -- we get it just fine right away
So it is likely to be specific to our process, may be it saturates the pool of connections per process or something like that, and we better add explicit control for that. WDYT @jwodder ?