alja / xrootd

Other
0 stars 0 forks source link

600 jobs on broken hadoop system (origin xrootd.t2.ucsd.edu) #69

Open alja opened 9 years ago

alja commented 9 years ago

10% jobs failed at the startup --- had XRD_REQUESTTIMEOUT 600 , traffic as expected from 534 jobs

uclhc-dsk uclhc-net

alja commented 9 years ago

1% of jobs need 30 min to exit:

In between that delay there are failed reads. I'm missing logs from xrdfragcp jobs and I don't know why they did not exit on error.

alja commented 9 years ago

Types of errors

Have fails on kXR_read and kXR_open. Possible addiditonal info is operation not permitted, operation expired.

There are also socket timeout or resource unavalilable errors which seem to be recoverable.