cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

Triton fallback server failure in workflow 10805.31 step 3 #37767

Open makortel opened 2 years ago

makortel commented 2 years ago

(moving from https://github.com/cms-sw/cmssw/issues/37738#issuecomment-1114455507)

The workflow 10805.31 step 3 fails with

Starting python2 /data/cmsbld/jenkins/workspace/ib-run-relvals/cms-bot/monitor_workflow.py timeout --signal SIGTERM 9000 cmsRun -j JobReport3.xml step3_RAW2DIGI_L1Reco_RECO_RECOSIM_PAT_VALIDATION_DQM.py
sh: nvidia-smi: command not found
%MSG-i ThreadStreamSetup: (NoModuleName) 02-May-2022 01:20:31 CEST pre-events
setting # threads 4
setting # streams 4
%MSG
02-May-2022 01:20:48 CEST Initiating request to open file file:step2.root
02-May-2022 01:20:50 CEST Successfully opened file file:step2.root
2022-05-02 01:21:27.078436: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
----- Begin Fatal Exception 02-May-2022 01:25:37 CEST-----------------------
An exception of category 'FallbackFailed' occurred while
[0] Calling beginJob
Exception Message:
Starting the fallback server failed with exit code 256

since CMSSW_12_4 2022-04-29-2300.

cmsbuild commented 2 years ago

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 2 years ago

assign heterogeneous

cmsbuild commented 2 years ago

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 2 years ago

@kpedro88 Can you take a look? Is sh: nvidia-smi: command not found a smoking gun?

kpedro88 commented 2 years ago

No, that just means that GPUs aren't available and the fallback server will use the CPU. (It comes from https://github.com/cms-sw/cmssw/blob/master/HeterogeneousCore/SonicTriton/python/TritonService_cff.py#L11).

kpedro88 commented 2 years ago

I only see this failure occurring in ppc/aarch IBs. In #35328, I disabled SonicTriton unit tests for non-amd64 OSes, because these require a different container that we do not yet have available.

kpedro88 commented 2 years ago

Ah, those failures are the DRN unit test in 12_3_X (I had the wrong tab open in the IB dashboard). The linked comment does show the workflow failing in the IB matrix tests. @ssrothman can you take a look?

Dr15Jones commented 2 years ago

These RelVals are failing on x86

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/el8_amd64_gcc10/CMSSW_12_4_X_2022-05-01-2300/pyRelValMatrixLogs/run/10805.31_SingleGammaPt35+2018_photonDRN+SingleGammaPt35_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano/step3_SingleGammaPt35+2018_photonDRN+SingleGammaPt35_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano.log#/

ssrothman commented 2 years ago

The workflow runs successfully on lpc interactive nodes (and iirc ran without problems in the PR tests). I'm not sure what could be causing it to fail in the IB.

It looks like the failure is coming when spinning up the triton fallback server, which indicates to me that whatever is happening is happening in the sonic service rather than in the photonDRN producer. @kpedro88 what is exit code 256? Why might starting the fallback server fail?

Dr15Jones commented 2 years ago

Exit code 256 is a standard unit exit code for when the executable can not be found.

ssrothman commented 2 years ago

So maybe the IB unit test can not find the Triton executable/container with which to start up the fallback server?

kpedro88 commented 2 years ago

Yes, I also tried running it on the lpc interactive nodes and was successful. @smuzaffar can you let one of us log into an IB test machine to try it there?

smuzaffar commented 2 years ago

I think error is random, for some IB it is 10805.31 and for other it is 10804.31 . Both if these workflows ran successfully for today's IB CMSSW_12_4 2022-05-02-1100

smuzaffar commented 2 years ago

@kpedro88 , try running on cmsdev nodes.

kpedro88 commented 2 years ago

@Dr15Jones does your change in https://github.com/cms-sw/cmssw/pull/37764/files#diff-8899bb666d79547ed027cd8f543a4debcff63d07dde5211a87ca11527140adf3R269-R273 ensure that the output of the container launching script cmsTriton will be printed when an error occurs? That will probably make it easier to debug this in the future.

We could put this info directly into the exception message. I could also try to print the contents of the server logfile in the exception message.

Dr15Jones commented 2 years ago

I funnel the output of the execSys call to LogError so it will be printed. I didn't add it to the exception since I didn't know how big it would be.

kpedro88 commented 2 years ago

The output of the command execution should only be a few lines. The server log will be a lot longer.

kpedro88 commented 2 years ago

In the latest IB (CMSSW_12_4_X_2022-05-03-1100), the workflow failed for slc7_amd64_gcc11 (succeeded in production arch slc7_amd64_gcc10).

Thanks to #37764, we get a little more information:

%MSG-e TritonService:  AfterModDestruction 03-May-2022 15:54:33 CEST pre-events
CMS_TRITON_GRPC_PORT: 8001
INFO:    instance started successfully
timed out waiting for server to start
INFO:    Stopping triton_server_instance_df61b571-1d28-480b-a54e-75df040f0d71 instance of /cvmfs/unpacked.cern.ch/.flat/87/874d87357f8f9b1b153d4d40a005349f5e2824d5cb46a54f24110f0f5d122c72 (PID=30808)
/cvmfs/cms-ib.cern.ch/nweek-02731/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_4_X_2022-05-03-1100/bin/slc7_amd64_gcc11/cmsTriton: line 275: 31499 Killed                  $DRYRUN singularity run instance://${SERVER} tritonserver $PORTARGS $REPOARGS $VERBOSE &>${REDIR}

It may be that loading the Singularity container over cvmfs is slow on some IB test nodes. Currently, the server start timeout is set to 120 seconds. Should we increase it to e.g. 300 seconds and see if that resolves the sporadic failures?

makortel commented 2 years ago

CMSSW_12_4_X_2022-05-03-1100 (that includes https://github.com/cms-sw/cmssw/pull/37764) slc7_amd64_gcc11 10804.31 failed with

%MSG-e TritonService:  AfterModDestruction 03-May-2022 15:54:33 CEST pre-events
CMS_TRITON_GRPC_PORT: 8001
INFO:    instance started successfully
timed out waiting for server to start
INFO:    Stopping triton_server_instance_df61b571-1d28-480b-a54e-75df040f0d71 instance of /cvmfs/unpacked.cern.ch/.flat/87/874d87357f8f9b1b153d4d40a005349f5e2824d5cb46a54f24110f0f5d122c72 (PID=30808)
/cvmfs/cms-ib.cern.ch/nweek-02731/slc7_amd64_gcc11/cms/cmssw/CMSSW_12_4_X_2022-05-03-1100/bin/slc7_amd64_gcc11/cmsTriton: line 275: 31499 Killed                  $DRYRUN singularity run instance://${SERVER} tritonserver $PORTARGS $REPOARGS $VERBOSE &>${REDIR}

%MSG
----- Begin Fatal Exception 03-May-2022 15:54:33 CEST-----------------------
An exception of category 'FallbackFailed' occurred while
   [0] Calling beginJob
Exception Message:
TritonService: Starting the fallback server failed with exit code 256
----- End Fatal Exception -------------------------------------------------

https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc11/CMSSW_12_4_X_2022-05-03-1100/pyRelValMatrixLogs/run/10804.31_SingleGammaPt10+2018_photonDRN+SingleGammaPt10_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano/step3_SingleGammaPt10+2018_photonDRN+SingleGammaPt10_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano.log#/

qliphy commented 2 years ago

CMSSW_12_4 2022-05-08-0000 failure message:

%MSG-e TritonService: AfterModDestruction 08-May-2022 12:29:48 CEST pre-events CMS_TRITON_GRPC_PORT: 8001 INFO: instance started successfully timed out waiting for server to start INFO: Stopping triton_server_instance_80dac23e-f01f-4c10-ae9c-689a8300f3e0 instance of /cvmfs/unpacked.cern.ch/.flat/87/874d87357f8f9b1b153d4d40a005349f5e2824d5cb46a54f24110f0f5d122c72 (PID=14943) /cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-08-0000/bin/slc7_amd64_gcc10/cmsTriton: line 275: 15154 Killed $DRYRUN singularity run instance://${SERVER} tritonserver $PORTARGS $REPOARGS $VERBOSE &>${REDIR}

%MSG ----- Begin Fatal Exception 08-May-2022 12:29:48 CEST----------------------- An exception of category 'FallbackFailed' occurred while [0] Calling beginJob Exception Message: TritonService: Starting the fallback server failed with exit code 256 ----- End Fatal Exception ------------------------------------------------- 08-May-2022 12:29:49 CEST Closed file file:step2.root

kpedro88 commented 2 years ago

I made two suggestions earlier in the thread and did not get feedback.

  1. Increase the server start timeout from 120 seconds to 300 seconds
  2. Print the contents of the server logfile in the exception message (or a LogError if having a very long exception message is undesirable, though if so, I would like to understand why.)

Should I try one or both of these?

Dr15Jones commented 2 years ago

Do not do 2. We already saw that the client side can generate long messages (which we are printing using LogError). If you REALLY want server logs they should also be handled via a LogError.

We do not want long exception messages as those are passed on to the workflow management system. In addition, long exception messages make it much harder to also just quickly understand what the problem is in general.

Dr15Jones commented 2 years ago

In addition, just increasing the timeout may reduce the error rate, but will not solve the underlying problem (which is still not fully understood).

kpedro88 commented 2 years ago

I think printing the server logs is the only way to distinguish a Triton-specific problem from a cvmfs problem (too high load, slow downloading, etc.). I will submit a PR to put the server log into LogError.

kpedro88 commented 2 years ago

see #37896

kpedro88 commented 2 years ago

Now in CMSSW_12_4_X_2022-05-11-1100, the log shows:

%MSG-e TritonService:  AfterModDestruction 11-May-2022 15:59:48 CEST pre-events
CMS_TRITON_GRPC_PORT: 8001
INFO:    instance started successfully
timed out waiting for server to start
INFO:    Stopping triton_server_instance_4216eb40-8ca8-475c-bdad-5551de8d8112 instance of /cvmfs/unpacked.cern.ch/.flat/87/874d87357f8f9b1b153d4d40a005349f5e2824d5cb46a54f24110f0f5d122c72 (PID=21478)
/cvmfs/cms-ib.cern.ch/nweek-02732/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_X_2022-05-11-1100/bin/slc7_amd64_gcc10/cmsTriton: line 275: 21674 Killed                  $DRYRUN singularity run instance://${SERVER} tritonserver $PORTARGS $REPOARGS $VERBOSE &>${REDIR}

%MSG
%MSG-e TritonService:  AfterModDestruction 11-May-2022 15:59:48 CEST pre-events
TritonService: server log log_triton_server_instance_4216eb40-8ca8-475c-bdad-5551de8d8112.log

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 21.06 (build 24198912)

Copyright (c) 2018-2021, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
find: '/usr/lib/ssl/private': Permission denied

%MSG
----- Begin Fatal Exception 11-May-2022 15:59:48 CEST-----------------------
An exception of category 'FallbackFailed' occurred while
   [0] Calling beginJob
Exception Message:
TritonService: Starting the fallback server failed with exit code 256
----- End Fatal Exception -------------------------------------------------

The server log looks like what I saw when I artificially shortened the server startup waiting time to test #37896. This leads me to suspect it is just slow in loading the server image from cvmfs, so I would go back to my initial suggestion of just increasing the wait time.

kpedro88 commented 2 years ago

any feedback on increasing the wait time?

smuzaffar commented 2 years ago

@kpedro88 , no objections on increasing the timeout to 300s

kpedro88 commented 2 years ago

see #38048

makortel commented 2 years ago

Copying @Dr15Jones' https://github.com/cms-sw/cmssw/pull/38048#issuecomment-1136062341

This doesn't appear to have solved all the issues. We had a timeout failure in https://cmssdt.cern.ch/SDT/cgi-bin/logreader/slc7_amd64_gcc10/CMSSW_12_5_X_2022-05-24-1100/pyRelValMatrixLogs/run/10804.31_SingleGammaPt10+2018_photonDRN+SingleGammaPt10_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano/step3_SingleGammaPt10+2018_photonDRN+SingleGammaPt10_pythia8_GenSimINPUT+Digi+RecoFakeHLT+HARVESTFakeHLT+ALCA+Nano.log#/187-187

The relevant part of the log appears to be

%MSG-w TritonFailure:  PatPhotonDRNCorrectionProducer:patPhotonsDRN  24-May-2022 13:44:45 CEST Run: 1 Event: 8
An exception of category 'TritonFailure' occurred.
Exception Message:
evaluate(): unable to run and/or get result: Deadline Exceeded

%MSG

----- Begin Fatal Exception 24-May-2022 13:44:45 CEST-----------------------
An exception of category 'SonicCallFailed' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 2 stream: 1
   [1] Running path 'dqmofflineOnPAT_1_step'
   [2] Prefetching for module SingleTopTChannelLeptonDQM_miniAOD/'singleTopElectronMediumDQM_miniAOD'
   [3] Prefetching for module PATElectronSlimmer/'slimmedElectrons'
   [4] Calling method for module PatPhotonDRNCorrectionProducer/'patPhotonsDRN'
Exception Message:
call failed after max 1 tries
----- End Fatal Exception -------------------------------------------------
kpedro88 commented 2 years ago

This is a different error: the server started fine, but then the inference call did not complete in the maximum allotted time.

Right now, the timeout is set to a low value of 10 seconds: https://github.com/cms-sw/cmssw/blob/1bfd68cd1932e3351c7f120db16314c3d9867962/PhysicsTools/PatAlgos/python/slimming/patPhotonDRNCorrector_cfi.py#L12

@ssrothman can you suggest a better timeout value that will avoid spurious failures?

ssrothman commented 2 years ago

We can certainly increase the timeout, but it's not obvious to me what the largest reasonable time an inference call should take. I'm honestly not sure why it would ever take even 10 seconds to run an inference call seeing as everything is happening on the local fallback server, and inference should take only 10s of ms.

makortel commented 2 years ago

Seems that in the following IB, CMSSW_12_5_X_2022-05-24-2300, the workflow succeeded in all IB variants tested there, so at least the failure has become less frequent.

smuzaffar commented 1 year ago

@kpedro88 @ssrothman we still see this failure in 12.6.X IBs. Could it be that if multiple relvals run in parallel and try to start the triton inference server then only one will run and all other instances of triton server will fail?

kpedro88 commented 1 year ago

@smuzaffar as I noted above:

This is a different error: the server started fine, but then the inference call did not complete in the maximum allotted time.

(The cmsTriton script tries to avoid conflicting parallel servers by looking for separate ports: https://github.com/cms-sw/cmssw/blob/38405a5b319be8ec094c981d6b45320aa577676a/HeterogeneousCore/SonicTriton/scripts/cmsTriton#L164-L182. This is not a perfect solution, but the problem here is different.)