Closed rly closed 2 years ago
It turns out that re-running the test makes the old error log difficult to find so I have copied it here:
This occurred within the last 24 hours? What channel are you downloading the test file from?
We had a power transformer explode in front of the building on Thursday afternoon and had to bring down the cluster and all workstations here as a precautionary measure. For at least some channels, that meant there were no longer any peers online that actually had the files, until around 11 AM EDT yesterday when we brought things back up.
Ryan, can you confirm whether the tests in question are downloading files provided by a peer you control? If possible it’d be nice to eliminate remote server status as a variable for the test in the future.
Thanks for the quick response @jsoules ! The timeout errors occurred around 1pm and 8pm ET on Friday. I believe Jeremy set up this test initially and did not configure pulling the file from a particular channel so I assume it is being hosted only at the Flatiron Institute. The error has not returned in the last three nightly CI builds, so probably it was a temporary issue related to the Flatiron cluster shutdown.
In any case, I agree, the file should be hosted by a kachery server that we control. I'll look into that.
I believe this has been fixed.
In the CI test suite, we use kachery to download a test file https://github.com/LorenFrankLab/nwb_datajoint/blob/bd2e0e6be22b5c52cec020eaee3d983376501c66/tests/test_1.py#L18-L20
This was working fine for weeks, but in the last 24 hours, the test has failed a couple times, randomly, due to a
TimeoutError
when downloading the file using kachery > urllib.Re-running the test often resolves the issue.
@jsoules do you have any ideas of what might be wrong? Is the kachery server intermittently down such that the CI gets a timeout? I have tried running this same code to download the test file on my local machine, hundreds of times in a for loop, and have not been able to reproduce the timeout error.