Closed charleskawczynski closed 1 month ago
This is needed to fix #2530.
Here's a summary of what is passing/failing:
Central
GPU: GPU dry baroclinic wave | qdstrm error 27%
GPU: GPU moist Held-Suarez | qdstrm error 16%
GPU: GPU moist Held-Suarez cloud diagnostics per stage | qdstrm error 17%
:umbrella: GPU: gpu_aquaplanet_dyamond | qdstrm error 27%
GPU: Prognostic EDMFX aquaplanet | qdstrm error 55%
Clima
dry baroclinic wave | rpc returns EmptyMessage
moist Held-Suarez | rpc returns EmptyMessage
moist Held-Suarez - 4 gpus | multi-rpc returns EmptyMessage
dry baroclinic wave - 4 gpus | success
gpu_aquaplanet_dyamond - strong scaling - 1 GPU | success
gpu_aquaplanet_diagedmf - 1 GPU | success
Error messages are:
qdstrm
Generating '/tmp/slurm-40915165/nsys-report-4d1c.qdstrm'
[1/1] [====27% ] report.nsys-rep
Importer error status: Importation failed.
Import Failed with unexpected exception: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Host/QdstrmImporter/main.cpp(34): Throw in function {anonymous}::Importer::Importer(const boost::filesystem::path&, const boost::filesystem::path&)
Dynamic exception type: boost::wrapexcept<QuadDCommon::RuntimeException>
std::exception::what: RuntimeException
[QuadDCommon::tag_message*] = Status: AnalysisFailed
Error {
Type: RuntimeError
SubError {
Type: InvalidArgument
Props {
Items {
Type: OriginalExceptionClass
Value: "N5boost10wrapexceptIN11QuadDCommon24InvalidArgumentExceptionEEE"
}
Items {
Type: OriginalFile
Value: "/dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Host/Analysis/Modules/EventCollection.cpp"
}
Items {
Type: OriginalLine
Value: "1055"
}
Items {
Type: OriginalFunction
Value: "void QuadDAnalysis::EventCollection::CheckOrder(QuadDAnalysis::EventCollectionHelper::EventContainer&, const QuadDAnalysis::ConstEvent&) const"
}
Items {
Type: ErrorText
Value: "Wrong event order has been detected when adding events to the collection:\nnew event ={ StartNs=403098042813 StopNs=403129613160 GlobalId=349883374385042 Event={ TraceProcessEvent=[{ Correlation=139850 EventClass=1 TextId=920 ReturnValue=0 },] } Type=48 }\nlast event ={ StartNs=448052547615 StopNs=448084509068 GlobalId=349883374385042 Event={ TraceProcessEvent=[{ Correlation=209574 EventClass=1 TextId=920 ReturnValue=0 },] } Type=48 }"
}
}
}
}
Generated:
/central/scratch/esm/slurm-buildkite/climaatmos-ci/18394/climaatmos-ci/target_gpu_implicit_baroclinic_wave/output_active/report.qdstrm
and
/dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/AgentAPI/Src/SessionImpl.cpp(18): rpc Start(.Agent.StartRequest) returns (.Agent.EmptyMessage);
is canceled because the timeout period is expired
🚨 Error: The command exited with status 1
I think I fixed this at some point (at least, on clima). Is this still an issue?
Yeah, the original failure seems to be fixed, but it does look like one issue remains: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/330#019151a2-dc7f-4525-aba6-b92ea170dd76:
┌ Info: Progress
│ simulation_time = "4 hours, 49 minutes"
│ n_steps_completed = 193
│ wall_time_per_step = "945 milliseconds, 292 microseconds"
│ wall_time_total = "15 minutes, 7 seconds"
│ wall_time_remaining = "12 minutes, 5 seconds"
│ wall_time_spent = "3 minutes, 2 seconds"
│ percent_complete = "20.1%"
│ sypd = 0.261
│ date_now = 2024-08-14T09:56:47.422
└ estimated_finish_date = 2024-08-14T10:08:52.461
The target application terminated. One or more process it created re-parented.
Waiting for termination of re-parented processes.
Use the `--wait` option to modify this behavior.
Generating '/tmp/slurm-35905/nsys-report-09db.qdstrm'
Should we keep this issue open for this new error? The title is sufficiently general 🤷🏻♂️
Yes, at least this seems to be consistent. It is always with that particular job:
The pipeline is still failing:
Without nsight, the jobs run to completion: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/344
Without nsight, the jobs run to completion: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/344
That's a different error, the reports are being generated, now it's OOMing. I'm going to close this and open a new issue.
Opened #3375.
We need to fix the broken NVTX reports, both on central and on clima.