cernopendata / cernopendata-client

CERN Open Data command-line client
http://cernopendata-client.readthedocs.io/
GNU General Public License v3.0
10 stars 9 forks source link

downloader: catch more error situations #98

Open tiborsimko opened 3 years ago

tiborsimko commented 3 years ago

Now that we are catching downloading errors via --retry-limit and --retry-sleep parameters thanks to #91, we can build upon this basis and enrich the error handling logic to catch more errors (such as Error 500 received from server, etc).

Example how to reproduce:

  1. Run locally a CERN Open Data instance:
$ cd opendata.cern.ch
$ docker-compose build
$ docker-compose up
$ docker exec -i -t opendatacernch_web_1 ./scripts/populate-instance.sh --skip-records
$ docker exec -i -t opendatacernch_web_1 cernopendata fixtures records --mode insert-or-replace -f cernopendata/modules/fixtures/data/records/cms-tools-higgsexample20112012.json
$ firefox http://localhost/record/5500
  1. Launch downloading of files with big enough retry sleep time:
$ cernopendata-client download-files --recid 5500 --verify  --server http://localhost --retry-sleep 30
  1. While the download is running, bring down the web site.

Current behaviour of the client is to output Python traceback:

==> Verifying file demoanalyzer_cfg_level4data.py...
  -> Expected size 3821, found 3821
  -> Expected checksum adler32:177b49c0, found adler32:177b49c0
==> Downloading file 10 of 11
  -> File: ./5500/mass4l_combine.pdf
Traceback (most recent call last):
  File ".../bin/cernopendata-client", line 8, in <module>
    sys.exit(cernopendata_client())
  File ".../lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File ".../lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File ".../lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File ".../lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File ".../lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File ".../lib/python3.8/site-packages/cernopendata_client/cli.py", line 337, in download_files
    download_single_file(path=path, file_location=file_location, protocol=protocol)
  File ".../lib/python3.8/site-packages/cernopendata_client/downloader.py", line 109, in download_single_file
    c.perform()
pycurl.error: (52, 'Empty reply from server')

Expected behaviour is to:

  1. The same test can be performed not only when the server is down, but simulating various responses from the server, such as Error 500, Error 206, etc. This can be done easily by editing Nginx configuration (or the Flask app) to return various kinds of errors.
tiborsimko commented 3 years ago

Example of "native" retry tools:

Haven't looked at what pycurl/xrootd may have built-in. We could always improve our own for _retry in range loop not to check only fro error = ( DOWNLOAD_ERROR_PAGE[..., but turn it into a more complete function catching other exceptional situations.

tiborsimko commented 3 years ago

Example: I just launch two commands quickly after each other, and I saw:

$ wget http://localhost/eos/opendata/atlas/OutreachDatasets/2020-01-22/GamGam.zip
--2020-11-19 11:56:39--  http://localhost/eos/opendata/atlas/OutreachDatasets/2020-01-22/GamGam.zip
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:80... connected.
HTTP request sent, awaiting response... 403 Forbidden
2020-11-19 11:56:40 ERROR 403: Forbidden.

$ $ wget http://localhost/eos/opendata/atlas/OutreachDatasets/2020-01-22/GamGam.zip
--2020-11-19 11:56:44--  http://localhost/eos/opendata/atlas/OutreachDatasets/2020-01-22/GamGam.zip
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1594567234 (1.5G) [application/zip]
Saving to: ‘GamGam.zip’

The 403 is a rather frequent error by the underlying reverse proxy that we also worked around in #91 via size and checksum.

After we introduce exit status based retry, we could probably remove that check-error-page-size-and-checksum hack.

tiborsimko commented 3 years ago

Example: 206 Partial Content situation occurred several times while downloading one big file via wget:

$ wget http://opendata.cern.ch/eos/opendata/cms/MonteCarlo2011/Summer11LegDR/BuToKstarMuMuV2_EtaPtFilter_7TeV-pythia6-evtgen/AODSIM/PU_S13_START53_LV6-v1/00000/000043B3-0E0B-E411-B2FE-0026189438E9.root
--2020-06-24 14:53:19--  http://opendata.cern.ch/eos/opendata/cms/MonteCarlo2011/Summer11LegDR/BuToKstarMuMuV2_EtaPtFilter_7TeV-pythia6-evtgen/AODSIM/PU_S13_START53_LV6-v1/00000/000043B3-0E0B-E411-B2FE-0026189438E9.root
Resolving opendata.cern.ch (opendata.cern.ch)... 188.184.93.89, 188.185.82.144, 188.184.28.138
Connecting to opendata.cern.ch (opendata.cern.ch)|188.184.93.89|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4011567937 (3.7G) [application/octet-stream]
Saving to: ‘000043B3-0E0B-E411-B2FE-0026189438E9.root.1’

000043B3-0E0B-E411-B2FE-0026189438E9.root.1                         26%[=========================================>                                                                                                                    ]   1.01G  7.82MB/s    in 87s

2020-06-24 14:54:46 (11.9 MB/s) - Connection closed at byte 1081769706. Retrying.

--2020-06-24 14:54:47--  (try: 2)  http://opendata.cern.ch/eos/opendata/cms/MonteCarlo2011/Summer11LegDR/BuToKstarMuMuV2_EtaPtFilter_7TeV-pythia6-evtgen/AODSIM/PU_S13_START53_LV6-v1/00000/000043B3-0E0B-E411-B2FE-0026189438E9.root
Connecting to opendata.cern.ch (opendata.cern.ch)|188.184.93.89|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 4011567937 (3.7G), 2929798231 (2.7G) remaining [application/octet-stream]
Saving to: ‘000043B3-0E0B-E411-B2FE-0026189438E9.root.1’

000043B3-0E0B-E411-B2FE-0026189438E9.root.1                         29%[++++++++++++++++++++++++++++++++++++++++++===>                                                                                                                ]   1.09G  11.5MB/s    in 6.7s

2020-06-24 14:54:54 (12.6 MB/s) - Connection closed at byte 1169634135. Retrying.

--2020-06-24 14:54:56--  (try: 3)  http://opendata.cern.ch/eos/opendata/cms/MonteCarlo2011/Summer11LegDR/BuToKstarMuMuV2_EtaPtFilter_7TeV-pythia6-evtgen/AODSIM/PU_S13_START53_LV6-v1/00000/000043B3-0E0B-E411-B2FE-0026189438E9.root
Connecting to opendata.cern.ch (opendata.cern.ch)|188.184.93.89|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 4011567937 (3.7G), 2841933802 (2.6G) remaining [application/octet-stream]
Saving to: ‘000043B3-0E0B-E411-B2FE-0026189438E9.root.1’

000043B3-0E0B-E411-B2FE-0026189438E9.root.1                         56%[++++++++++++++++++++++++++++++++++++++++++++++=========================================>                                                                      ]   2.09G  15.1MB/s    in 88s

2020-06-24 14:56:24 (11.7 MB/s) - Connection closed at byte 2248454684. Retrying.

--2020-06-24 14:56:27--  (try: 4)  http://opendata.cern.ch/eos/opendata/cms/MonteCarlo2011/Summer11LegDR/BuToKstarMuMuV2_EtaPtFilter_7TeV-pythia6-evtgen/AODSIM/PU_S13_START53_LV6-v1/00000/000043B3-0E0B-E411-B2FE-0026189438E9.root
Connecting to opendata.cern.ch (opendata.cern.ch)|188.184.93.89|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 4011567937 (3.7G), 1763113253 (1.6G) remaining [application/octet-stream]
Saving to: ‘000043B3-0E0B-E411-B2FE-0026189438E9.root.1’

000043B3-0E0B-E411-B2FE-0026189438E9.root.1                         82%[++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++==========================================>                           ]   3.10G  13.1MB/s    in 82s

2020-06-24 14:57:49 (12.6 MB/s) - Connection closed at byte 3328930017. Retrying.

--2020-06-24 14:57:53--  (try: 5)  http://opendata.cern.ch/eos/opendata/cms/MonteCarlo2011/Summer11LegDR/BuToKstarMuMuV2_EtaPtFilter_7TeV-pythia6-evtgen/AODSIM/PU_S13_START53_LV6-v1/00000/000043B3-0E0B-E411-B2FE-0026189438E9.root
Connecting to opendata.cern.ch (opendata.cern.ch)|188.184.93.89|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 4011567937 (3.7G), 682637920 (651M) remaining [application/octet-stream]
Saving to: ‘000043B3-0E0B-E411-B2FE-0026189438E9.root.1’

000043B3-0E0B-E411-B2FE-0026189438E9.root.1                         85%[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++===>                       ]   3.20G  6.88MB/s    in 10s

2020-06-24 14:58:04 (9.83 MB/s) - Connection closed at byte 3433695865. Retrying.

--2020-06-24 14:58:09--  (try: 6)  http://opendata.cern.ch/eos/opendata/cms/MonteCarlo2011/Summer11LegDR/BuToKstarMuMuV2_EtaPtFilter_7TeV-pythia6-evtgen/AODSIM/PU_S13_START53_LV6-v1/00000/000043B3-0E0B-E411-B2FE-0026189438E9.root
Connecting to opendata.cern.ch (opendata.cern.ch)|188.184.93.89|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 4011567937 (3.7G), 577872072 (551M) remaining [application/octet-stream]
Saving to: ‘000043B3-0E0B-E411-B2FE-0026189438E9.root.1’

000043B3-0E0B-E411-B2FE-0026189438E9.root.1                        100%[+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++======================>]   3.74G  14.6MB/s    in 36s

2020-06-24 14:58:44 (15.5 MB/s) - ‘000043B3-0E0B-E411-B2FE-0026189438E9.root.1’ saved [4011567937/4011567937]
ParthS007 commented 3 years ago

Musing after researching on the Issue:

There are two things we need to address

Every library we use(pycurl/requests/xrootdpyfs) has their own way of handling these two things.

  1. pycurl

    • Error handling:
      • SO Answer(Basic error handling)
      • Using pycurl.error and getinfo for different HTTP Codes
    • Retry:
      • Add a retry loop on WRITEDATA
      • Using pycurl.TIMEOUT and/or pycurl.CONNECTTIMEOUT or something similar(haven't tested yet)
  2. requests

    • Error handling
      • Add general try exception handling
      • Display messages according to the error
    • Retry
      • Add retry loop on requests.get
      • Use HTTPAdapter and retry from requests library - Blog link
  3. xrootdpyfs

    • Error handling
      • Add general try exception handling
      • Display messages according to error
    • Retry

Please let me know if I can break those steps as well!