DUNE / dist-comp

Action items for DUNE distributed computing, and common scripts that are used.
2 stars 0 forks source link

Problems at sites/storages after Rucio upgrade? #36

Closed Andrew-McNab-UK closed 1 year ago

Andrew-McNab-UK commented 1 year ago

Several storages which were ok last year have errors mentioning GFAL in the Automated Workflow Tests.

For example, running at CERN, the rucio upload to the LANCASTER RSE fails like this:

https://landscape.fnal.gov/lens/view/job/43729668.0@jobsub03.fnal.gov/

CERN LANCASTER https root://fal-pygrid-30.lancs.ac.uk//dpm/lancs.ac.uk/home/dune/testpro/d6/e1/awt-download-2023-01-13-01.txt 'xrdcp --force --nopbar --verbose root://fal-pygrid-30.lancs.ac.uk//dpm/lancs.ac.uk/home/dune/testpro/d6/e1/awt-download-2023-01-13-01.txt awt-download-2023-01-13-01.txt' returns 0 2023-01-30 14:36:16,123 DEBUG baseclient.py No trace_host passed. Using rucio_host instead 2023-01-30 14:36:16,123 DEBUG baseclient.py No creds passed. Trying to get it from the config file. 2023-01-30 14:36:16,123 DEBUG baseclient.py HTTPS is required, but no ca_cert was passed. Trying to get it from X509_CERT_DIR. 2023-01-30 14:36:16,123 DEBUG baseclient.py HTTPS is required, but no ca_cert was passed and X509_CERT_DIR is not defined. Trying to get it from the config file. 2023-01-30 14:36:16,123 DEBUG baseclient.py No account passed. Trying to get it from the RUCIO_ACCOUNT environment variable or the config file. 2023-01-30 14:36:16,123 DEBUG baseclient.py No VO passed. Trying to get it from environment variable RUCIO_VO. 2023-01-30 14:36:16,123 DEBUG baseclient.py No VO found. Trying to get it from the config file. 2023-01-30 14:36:16,123 DEBUG baseclient.py No VO found. Using default VO. 2023-01-30 14:36:16,123 DEBUG baseclient.py got token from file 2023-01-30 14:36:16,127 DEBUG uploadclient.py Num. of files that upload client is processing: 1 2023-01-30 14:36:16,128 DEBUG baseclient.py HTTP request: GET https://dune-rucio.fnal.gov/rses/?expression=LANCASTER 2023-01-30 14:36:16,128 DEBUG baseclient.py HTTP header: X-Rucio-Auth-Token: [hidden] 2023-01-30 14:36:16,128 DEBUG baseclient.py HTTP header: X-Rucio-VO: def 2023-01-30 14:36:16,128 DEBUG baseclient.py HTTP header: Connection: Keep-Alive 2023-01-30 14:36:16,128 DEBUG baseclient.py HTTP header: User-Agent: rucio-clients/1.29.10 2023-01-30 14:36:16,128 DEBUG baseclient.py HTTP header: X-Rucio-Script: rucio::--verbose 2023-01-30 14:36:16,128 DEBUG baseclient.py HTTP header: X-Rucio-Account: dunepro 2023-01-30 14:36:17,286 DEBUG baseclient.py HTTP Response: 200 OK 2023-01-30 14:36:17,904 DEBUG uploadclient.py Input validation done. 2023-01-30 14:36:17,904 INFO Preparing upload for file awt-1675089309-XRpTOQxuXf 2023-01-30 14:36:17,904 DEBUG baseclient.py HTTP request: GET https://dune-rucio.fnal.gov/rses/LANCASTER/attr/ 2023-01-30 14:36:17,904 DEBUG baseclient.py HTTP header: X-Rucio-Auth-Token: [hidden] 2023-01-30 14:36:17,904 DEBUG baseclient.py HTTP header: X-Rucio-VO: def 2023-01-30 14:36:17,904 DEBUG baseclient.py HTTP header: Connection: Keep-Alive 2023-01-30 14:36:17,904 DEBUG baseclient.py HTTP header: User-Agent: rucio-clients/1.29.10 2023-01-30 14:36:17,904 DEBUG baseclient.py HTTP header: X-Rucio-Script: rucio::--verbose 2023-01-30 14:36:17,904 DEBUG baseclient.py HTTP header: X-Rucio-Account: dunepro 2023-01-30 14:36:18,071 DEBUG baseclient.py HTTP Response: 200 OK 2023-01-30 14:36:18,071 DEBUG uploadclient.py wan domain is used for the upload 2023-01-30 14:36:18,071 DEBUG uploadclient.py Registering file 2023-01-30 14:36:18,071 DEBUG baseclient.py HTTP request: GET https://dune-rucio.fnal.gov/accounts/dunepro/scopes/ 2023-01-30 14:36:18,072 DEBUG baseclient.py HTTP header: X-Rucio-Auth-Token: [hidden] 2023-01-30 14:36:18,072 DEBUG baseclient.py HTTP header: X-Rucio-VO: def 2023-01-30 14:36:18,072 DEBUG baseclient.py HTTP header: Connection: Keep-Alive 2023-01-30 14:36:18,072 DEBUG baseclient.py HTTP header: User-Agent: rucio-clients/1.29.10 2023-01-30 14:36:18,072 DEBUG baseclient.py HTTP header: X-Rucio-Script: rucio::--verbose 2023-01-30 14:36:18,072 DEBUG baseclient.py HTTP header: X-Rucio-Account: dunepro 2023-01-30 14:36:18,232 DEBUG baseclient.py HTTP Response: 200 OK 2023-01-30 14:36:18,232 DEBUG uploadclient.py Skipping dataset registration 2023-01-30 14:36:18,233 DEBUG baseclient.py HTTP request: GET https://dune-rucio.fnal.gov/dids/testpro/awt-1675089309-XRpTOQxuXf/meta 2023-01-30 14:36:18,233 DEBUG baseclient.py HTTP header: X-Rucio-Auth-Token: [hidden] 2023-01-30 14:36:18,233 DEBUG baseclient.py HTTP header: X-Rucio-VO: def 2023-01-30 14:36:18,233 DEBUG baseclient.py HTTP header: Connection: Keep-Alive 2023-01-30 14:36:18,233 DEBUG baseclient.py HTTP header: User-Agent: rucio-clients/1.29.10 2023-01-30 14:36:18,233 DEBUG baseclient.py HTTP header: X-Rucio-Script: rucio::--verbose 2023-01-30 14:36:18,233 DEBUG baseclient.py HTTP header: X-Rucio-Account: dunepro 2023-01-30 14:36:18,391 DEBUG baseclient.py HTTP Response: 404 NOT FOUND 2023-01-30 14:36:18,391 DEBUG baseclient.py Response text (length=129): [{"ExceptionClass": "DataIdentifierNotFound", "ExceptionMessage": "Data identifier 'testpro:awt-1675089309-XRpTOQxuXf' not found"}] 2023-01-30 14:36:18,391 DEBUG uploadclient.py File DID does not exist 2023-01-30 14:36:18,391 DEBUG baseclient.py HTTP request: POST https://dune-rucio.fnal.gov/replicas 2023-01-30 14:36:18,391 DEBUG baseclient.py HTTP header: X-Rucio-Auth-Token: [hidden] 2023-01-30 14:36:18,391 DEBUG baseclient.py HTTP header: X-Rucio-VO: def 2023-01-30 14:36:18,391 DEBUG baseclient.py HTTP header: Connection: Keep-Alive 2023-01-30 14:36:18,391 DEBUG baseclient.py HTTP header: User-Agent: rucio-clients/1.29.10 2023-01-30 14:36:18,391 DEBUG baseclient.py HTTP header: X-Rucio-Script: rucio::--verbose 2023-01-30 14:36:18,391 DEBUG baseclient.py HTTP header: X-Rucio-Account: dunepro 2023-01-30 14:36:18,391 DEBUG baseclient.py Request data (length=266): [{"rse": "LANCASTER", "files": [{"scope": "testpro", "name": "awt-1675089309-XRpTOQxuXf", "bytes": 26, "adler32": "624 ... ty": true}] 2023-01-30 14:36:18,605 DEBUG baseclient.py HTTP Response: 201 CREATED 2023-01-30 14:36:18,605 INFO Successfully added replica in Rucio catalogue at LANCASTER 2023-01-30 14:36:18,606 DEBUG baseclient.py HTTP request: POST https://dune-rucio.fnal.gov/rules/ 2023-01-30 14:36:18,606 DEBUG baseclient.py HTTP header: X-Rucio-Auth-Token: [hidden] 2023-01-30 14:36:18,606 DEBUG baseclient.py HTTP header: X-Rucio-VO: def 2023-01-30 14:36:18,606 DEBUG baseclient.py HTTP header: Connection: Keep-Alive 2023-01-30 14:36:18,606 DEBUG baseclient.py HTTP header: User-Agent: rucio-clients/1.29.10 2023-01-30 14:36:18,606 DEBUG baseclient.py HTTP header: X-Rucio-Script: rucio::--verbose 2023-01-30 14:36:18,606 DEBUG baseclient.py HTTP header: X-Rucio-Account: dunepro 2023-01-30 14:36:18,606 DEBUG baseclient.py Request data (length=445): [{"dids": [{"scope": "testpro", "name": "awt-1675089309-XRpTOQxuXf"}], "copies": 1, "rse_expression": "LANCASTER", "we ... ta": null}] 2023-01-30 14:36:18,875 DEBUG baseclient.py HTTP Response: 201 CREATED 2023-01-30 14:36:18,875 INFO Successfully added replication rule at LANCASTER 2023-01-30 14:36:18,887 DEBUG logging.py gfal.Default: connecting to storage 2023-01-30 14:36:18,924 DEBUG logging.py gfal.Default: checking if file exists None 2023-01-30 14:36:18,925 DEBUG rsemanager.py Checking if https://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/dune/testpro/7f/b2/awt-1675089309-XRpTOQxuXf exists 2023-01-30 14:36:18,925 DEBUG logging.py gfal.Default: checking if file exists https://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/dune/testpro/7f/b2/awt-1675089309-XRpTOQxuXf 2023-01-30 14:36:19,371 DEBUG logging.py gfal.Default: closing protocol connection 2023-01-30 14:36:19,372 DEBUG uploadclient.py [{'hostname': 'fal-pygrid-30.lancs.ac.uk', 'scheme': 'https', 'port': 0, 'prefix': '/dpm/lancs.ac.uk/home/dune', 'impl': 'rucio.rse.protocols.gfal.Default', 'domains': {'lan': {'read': 0, 'write': 0, 'delete': 0}, 'wan': {'read': 1, 'write': 1, 'delete': 1, 'third_party_copy_read': 1, 'third_party_copy_write': 1}}, 'extended_attributes': None}] 2023-01-30 14:36:19,372 INFO Trying upload with https to LANCASTER 2023-01-30 14:36:19,372 DEBUG uploadclient.py Processing upload with the domain: wan 2023-01-30 14:36:19,372 DEBUG logging.py gfal.Default: connecting to storage 2023-01-30 14:36:19,375 DEBUG uploadclient.py The PFN created from the LFN: https://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/dune/testpro/7f/b2/awt-1675089309-XRpTOQxuXf 2023-01-30 14:36:19,375 DEBUG logging.py gfal.Default: checking if file exists https://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/dune/testpro/7f/b2/awt-1675089309-XRpTOQxuXf 2023-01-30 14:36:19,709 DEBUG logging.py gfal.Default: checking if file exists https://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/dune/testpro/7f/b2/awt-1675089309-XRpTOQxuXf.rucio.upload 2023-01-30 14:36:19,848 DEBUG utils.py put: Attempt 1 2023-01-30 14:36:19,848 DEBUG logging.py gfal.Default: uploading file from awt-1675089309-XRpTOQxuXf to https://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/dune/testpro/7f/b2/awt-1675089309-XRpTOQxuXf.rucio.upload error in request of checking file QoS:  is not a valid HTTP or Webdav URL

(process:1479): GLib-WARNING **: 14:36:19.849: GError set over the top of a previous GError or uninitialized memory. This indicates a bug in someone's code. You must ensure an error is NULL before it's set. The overwriting error message was: \x01 is not a valid HTTP or Webdav URL 2023-01-30 14:36:20,160 DEBUG utils.py put: Attempt failed 1 2023-01-30 14:36:20,160 DEBUG utils.py Access to local destination denied. Details: Source file not found. Details: TRANSFER ERROR: Copy failed (streamed). Last attempt: HTTP 404 : File not found 2023-01-30 14:36:20,160 DEBUG logging.py gfal.Default: uploading file from awt-1675089309-XRpTOQxuXf to https://fal-pygrid-30.lancs.ac.uk/dpm/lancs.ac.uk/home/dune/testpro/7f/b2/awt-1675089309-XRpTOQxuXf.rucio.upload error in request of checking file QoS:  is not a valid HTTP or Webdav URL

(process:1479): GLib-WARNING **: 14:36:20.161: GError set over the top of a previous GError or uninitialized memory. This indicates a bug in someone's code. You must ensure an error is NULL before it's set. The overwriting error message was: \x01 is not a valid HTTP or Webdav URL 2023-01-30 14:36:20,430 WARNING Upload attempt failed 2023-01-30 14:36:20,430 INFO Exception: RSE does not support requested operation. Details: Access to local destination denied. Details: Source file not found. Details: TRANSFER ERROR: Copy failed (streamed). Last attempt: HTTP 404 : File not found Traceback (most recent call last): File "/cvmfs/dune.opensciencegrid.org/products/dune/rucio/v1_29_10/NULL/lib/python3.6/site-packages/rucio/rse/protocols/gfal.py", line 462, in __gfal2_copy ret = ctx.filecopy(params, str(src), str(dest)) gfal2.GError: TRANSFER ERROR: Copy failed (streamed). Last attempt: HTTP 404 : File not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/cvmfs/dune.opensciencegrid.org/products/dune/rucio/v1_29_10/NULL/lib/python3.6/site-packages/rucio/rse/protocols/gfal.py", line 279, in put status = self.__gfal2_copy(str(source_url), str(target), None, space_token, transfer_timeout=transfer_timeout) File "/cvmfs/dune.opensciencegrid.org/products/dune/rucio/v1_29_10/NULL/lib/python3.6/site-packages/rucio/rse/protocols/gfal.py", line 470, in __gfal2_copy raise exception.SourceNotFound(error) rucio.common.exception.SourceNotFound: Source file not found. Details: TRANSFER ERROR: Copy failed (streamed). Last attempt: HTTP 404 : File not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/cvmfs/dune.opensciencegrid.org/products/dune/rucio/v1_29_10/NULL/lib/python3.6/site-packages/rucio/client/uploadclient.py", line 649, in _upload_item retry(protocol_write.put, base_name, pfn_tmp, source_dir, transfer_timeout=transfer_timeout)(mtries=2, logger=logger) File "/cvmfs/dune.opensciencegrid.org/products/dune/rucio/v1_29_10/NULL/lib/python3.6/site-packages/rucio/common/utils.py", line 1575, in call return self.func(*self.args, **self.kwargs) File "/cvmfs/dune.opensciencegrid.org/products/dune/rucio/v1_29_10/NULL/lib/python3.6/site-packages/rucio/rse/protocols/gfal.py", line 285, in put raise exception.DestinationNotAccessible(str(error)) rucio.common.exception.DestinationNotAccessible: Access to local destination denied. Details: Source file not found. Details: TRANSFER ERROR: Copy failed (streamed). Last attempt: HTTP 404 : File not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/cvmfs/dune.opensciencegrid.org/products/dune/rucio/v1_29_10/NULL/lib/python3.6/site-packages/rucio/client/uploadclient.py", line 269, in upload sign_service=sign_service) File "/cvmfs/dune.opensciencegrid.org/products/dune/rucio/v1_29_10/NULL/lib/python3.6/site-packages/rucio/client/uploadclient.py", line 652, in _upload_item raise RSEOperationNotSupported(str(error)) rucio.common.exception.RSEOperationNotSupported: RSE does not support requested operation. Details: Access to local destination denied. Details: Source file not found. Details: TRANSFER ERROR: Copy failed (streamed). Last attempt: HTTP 404 : File not found /cvmfs/dune.opensciencegrid.org/products/dune/rucio/v1_29_10/NULL/lib/python3.6/site-packages/urllib3/connectionpool.py:1050: InsecureRequestWarning: Unverified HTTPS request is being made to host 'dune-rucio.fnal.gov'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/1.26.x/advanced-usage.html#ssl-warnings InsecureRequestWarning, 2023-01-30 14:36:20,917 ERROR Failed to upload file awt-1675089309-XRpTOQxuXf 2023-01-30 14:36:20,917 ERROR None of the given files have been uploaded. Completed in 4.7950 sec. 'rucio --verbose upload --rse LANCASTER --protocol https --scope testpro --lifetime 86400 --name awt-1675089309-XRpTOQxuXf awt-1675089309-XRpTOQxuXf' returns 79

dougbenjamin commented 1 year ago

Andrew, Which sites now fail under rucio 1.29 vs which sites used to work under rucio 1.26?

StevenCTimm commented 1 year ago

Filed GGUS 160778 to BNL starting with the BNL to BNL transfer

StevenCTimm commented 1 year ago

It may be that part of the reason AWT files are failing to download is that those uploads themselves are failing and the files are not there.

StevenCTimm commented 1 year ago

We have identified two big causes thus far, one or both of the voms dns not being updated, and the problem that for some cases we had to change to davs protocol from https. Neither of these explain the failures at RAL_ECHO or DUNE_CERN_EOS we continue to investigate there. The known VOMS problems and corresponding open GGUS tickets are tracked in #39 . There is one other issue which Andrew has adjusted for, namely doing one more unset of the GFAL config and library paths just before doing the rucio upload. This can be explained by the fact that the rucio v1_29 client as now in CVMFS now comes with its own gfal2 libraries and configuration settings such that Rucio upload can be run outside of singularity if needed.

StevenCTimm commented 1 year ago

Three big sites--(1) DUNE_CERN_EOS. (rucio upload is not making the directory) (2) NIKHEF likewise, (3) SURFSARA both upload and download fails.

StevenCTimm commented 1 year ago

INC3421404 filed at CERN

StevenCTimm commented 1 year ago

SURFSara ticket https://ggus.eu/index.php?mode=ticket_info&ticket_id=160690

StevenCTimm commented 1 year ago

We have individual issues open for the remaining sites, Nikhef, surfsara, and CCIN2P3 Closing this.