Errors downloading data from http://geoschemdata.wustl.edu/ExtData/GEOS_2x2.5/MERRA2/

philotrum commented 2 years ago

What institution are you from?

University of Wollongong

Description of the problem

I have written a python script to pull data down from http://geoschemdata.wustl.edu/ExtData/GEOS_2x2.5/MERRA2/. It pulls a month’s data when it is run. I have tested the script on a Linux box running on the UoW campus quite a few times and am now testing it on the NCI gadi supercomputer in Australia. I am having a lot of files fail to copy running it on both computers. The code I am using to do the file copies is:

with urllib.request.urlopen(full_url) as response, open(local_filename, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)

full_url is the url to the file to be copied. local_filename is the file’s destination on the local computer.

I am running threads to pull up to 11 files in parallel to reduce the copy time.

The code successfully pulls the file first time most of the time, but it is failing to copy the file first time ~20% (a guess here) of the time.

Description of troubleshooting performed

The script retries the file copy after a 3 second delay up to 3 times if the copy fails and most of the time I get the file on the second attempt if the first attempt fails. I am having some files fail to download after 3 attempts though. It is failing to retrieve 0 to ~20 files each time I run the script for the first time for a month. I can run the script again if required and it will attempt to download any files missing from the local repository. I have been able pull all the files by doing this, but I have had to run the script up to 3 times to get all the data.

GEOS-Chem version

N/A

Description of modifications

NA

Log files

FileCopy output.txt

Software versions

N/A

yantosca commented 2 years ago

@LiamBindle wrote:

For the last two weeks there were a few periods that the server was slow to respond (geos-chem #1024). The admins suspect it was related to an ongoing file system issue, but they say it should be fixed. Besides this and some trivial file permission mixups, the server is operating as expected.

Could you try again and note the HTTP codes of any failures? The HTTP codes should indicate why the downloads fail (e.g., 429 - Too many requests, 408 - Request Timeout, etc.). It's possible there are transient 4** codes like 408 in which case adding some "retry after X seconds" logic to your download script would help (see urllib.error docs here). Let me know what you find. . . . The data portal is still experiencing slow connection times. Essentially, it's taking a long time to establish the connection, but once the connection is establish the downloads go pretty fast.

It's still a good idea to check the HTTP codes. Hopefully that can confirm it's 408s. The sysadmins are working on it, but it might take some time to resolved. They suspect it's caused by a high load.

In the mean time here are some suggestions to try

Set timeout=60 or timeout=120 in urllib.request.urlopen (docs, I'm guessing the default value is too low compared to our current connection times)

Check return HTTP codes. If 408, retry the download after X seconds.

Reuse connections for multiple downloads (i.e., several downloads per connection)

yantosca commented 2 years ago

FYI, @LiamBindle has found out that WUSTL portal is slow to respond and it will be for the next while. It's slow because the server is under speced for how many people using it. In the interim, we can suggest people use both the WUSTL and ComputeCanada servers. Now that people are using the WUSTL site heavily, the ComputeCanada server is faster to respond.

philotrum commented 2 years ago

I ran a single test using a 60 second timeout in the code to pull the data. The code is now:

with urllib.request.urlopen(full_url, timeout=60) as response, open(local_filename, 'wb') as out_file:
        shutil.copyfileobj(response, out_file)

The month's data came down with no fails in 9 minutes 46 seconds. Since there is more than one variable I am not sure if this is due to low traffic at the time of the test or the timeout or both. I will see how things go as I do more downloads.

Given the information above I might do a comparison between ComputeCanada and WUSTL and see what that looks like.

Thanks for your assistance with this.

philotrum commented 2 years ago

I ran a test comparing the WUSTL and ComputeCanada download speed. My script pulled all files without error for 202106 in 3 minutes 40 seconds from WUSTL and 202107 took 4 minutes 7 seconds without error from ComputeCanada. I think that the speed difference is in the noise, and that luck with how busy the server is is the primary variable. I have not had a download error since adding the timeout to the pull request. I guess that could be down to how busy the server is or the timeout.

yantosca commented 2 years ago

Thanks for testing @philotrum!

yantosca commented 2 years ago

Also covered in #1024

yantosca commented 2 years ago

Going to close this issue now. Let us know if you are still encountering problems with this.

geoschem / geos-chem