dmwm / PHEDEX

CMS data-placement suite
8 stars 18 forks source link

Better logging when FTS fails to submit. #1015

Closed PerilousApricot closed 8 years ago

PerilousApricot commented 8 years ago

I saw an old issue (#405) with the same report, which was reported as fixed. I'm unsure of the etiquette in talking about a previously closed issue, so if you would like for me to continue there, please let me know.

We seem to be getting a whole ton of errors with the detail log Could not submit to FTS . An example transfer log can be found here: https://gist.github.com/PerilousApricot/b60f8d6b229c1daf9ded

A first, minor point - could the perl hash be formatted with Data::Dumper? One long long line is difficult to walk through.

The bigger problem is that it doesn't actually dump the stdout/err from what I can see. Logging in as the phedex user, choosing a filename off the recent errors and manually submitting a glite-transfer-submit using the phedex proxy works. I don't know what's happening with PhEDEx that's different, or even where to start looking since the command output appears to be lost.

Would it be possible to add more verbose logging to the FTS backend? At the very least, the "-v" option might help enlighten things.

nikmagini commented 8 years ago

Hi Andrew,

When the glite-transfer-submit command really exits with an error, the full error message is indeed reported to the DB and displayed in the Transfer Log. See for example here:

https://cmsweb.cern.ch/phedex/prod/Activity::ErrorInfo?tofilter=T2_PL&fromfilter=.*&report_code=.*&xfer_code=.*&to_pfn=B2BC083A-B516-E511-B7E4-0025904C6564&from_pfn=.*&log_detail=Could+not+submit&log_validate=.*&.submit=Update#

Failed to determine the interface version of the service: getInterfaceVersion: SOAP fault: SOAP-ENV:Client - CGSI-gSOAP running on cms-vo.polgrid.pl reports Could NOT load client credentials GSS Major Status: General failure GSS Minor Status Error Chain: globus_gsi_gssapi: Error with GSI credential globus_gsi_gssapi: Error with gss credential handle globus_credential: Error with credential: The proxy credential: /home/phedex/gridcert/proxy.cert with subject: /C=PL/O=GRID/O=ICM/CN=Krzysztof Nawrocki/CN=proxy expired 31661 minutes ago.

What you see is actually an entirely different problem however...

In the case you reported, the glite-transfer-submit actually succeeds but the FTS backend is unable to determine the jobID from parsing the glite-transfer-submit stdout. The FTS backend prints in the "RAWOUTPUT" section what it thinks the stdout is. If you look closely at the log, you will see that the following stdout is printed:

Source: srm://se.hepgrid.uerj.br:8443/srm/v2/server?SFN=/mnt/hadoop/cms/store/data/Run2015D/HcalHPDNoise/MINIAOD/PromptReco-v3/000/256/840/00000/D424D498-8E60-E511-91F2-02163E011B5E.rooa76e3b79-4eba-418d-a6c5-a221c8ef11f4

This is clearly a jobID from the glite-transfer-submit stdout, mangled with part of the output of some glite-transfer-status command. I'm well aware of this long-standing issue with the FileDownload agent; with some painful debugging I was able to track it down to a probable bug in the underlying libraries used for cooperative multitasking, unfortunately we were unable to fix it, I'll continue to look into it:

https://github.com/dmwm/PHEDEX/issues/915

Fortunately it's not a huge problem because PhEDEx will simply reschedule the transfer later, and probably auto-succeed.

Closing this ticket as duplicate.

Thanks for the additional report! Cheers Nicolo'

P.S. I'll file a separate request for formatting nicely the job with Data::Dumper