CliMA / ClimaAtmos.jl

ClimaAtmos.jl is a library for building atmospheric circulation models that is designed from the outset to leverage data assimilation and machine learning tools. We welcome contributions!
Apache License 2.0
84 stars 18 forks source link

Fix broken NVTX reports #2911

Closed charleskawczynski closed 1 month ago

charleskawczynski commented 7 months ago

We need to fix the broken NVTX reports, both on central and on clima.

charleskawczynski commented 7 months ago

This is needed to fix #2530.

charleskawczynski commented 6 months ago

Here's a summary of what is passing/failing:

Central
GPU: GPU dry baroclinic wave                           | qdstrm error 27%
GPU: GPU moist Held-Suarez                             | qdstrm error 16%
GPU: GPU moist Held-Suarez cloud diagnostics per stage | qdstrm error 17%
:umbrella: GPU: gpu_aquaplanet_dyamond                 | qdstrm error 27%
GPU: Prognostic EDMFX aquaplanet                       | qdstrm error 55%

Clima
dry baroclinic wave                                    | rpc returns EmptyMessage
moist Held-Suarez                                      | rpc returns EmptyMessage
moist Held-Suarez - 4 gpus                             | multi-rpc returns EmptyMessage
dry baroclinic wave - 4 gpus                           | success
gpu_aquaplanet_dyamond - strong scaling - 1 GPU        | success
gpu_aquaplanet_diagedmf - 1 GPU                        | success
charleskawczynski commented 6 months ago

Error messages are:

qdstrm

Generating '/tmp/slurm-40915165/nsys-report-4d1c.qdstrm'
[1/1] [====27%                     ] report.nsys-rep
Importer error status: Importation failed.
Import Failed with unexpected exception: /dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Host/QdstrmImporter/main.cpp(34): Throw in function {anonymous}::Importer::Importer(const boost::filesystem::path&, const boost::filesystem::path&)
Dynamic exception type: boost::wrapexcept<QuadDCommon::RuntimeException>
std::exception::what: RuntimeException
[QuadDCommon::tag_message*] = Status: AnalysisFailed
Error {
  Type: RuntimeError
  SubError {
    Type: InvalidArgument
    Props {
      Items {
        Type: OriginalExceptionClass
        Value: "N5boost10wrapexceptIN11QuadDCommon24InvalidArgumentExceptionEEE"
      }
      Items {
        Type: OriginalFile
        Value: "/dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Host/Analysis/Modules/EventCollection.cpp"
      }
      Items {
        Type: OriginalLine
        Value: "1055"
      }
      Items {
        Type: OriginalFunction
        Value: "void QuadDAnalysis::EventCollection::CheckOrder(QuadDAnalysis::EventCollectionHelper::EventContainer&, const QuadDAnalysis::ConstEvent&) const"
      }
      Items {
        Type: ErrorText
        Value: "Wrong event order has been detected when adding events to the collection:\nnew event ={ StartNs=403098042813 StopNs=403129613160 GlobalId=349883374385042 Event={ TraceProcessEvent=[{ Correlation=139850 EventClass=1 TextId=920 ReturnValue=0 },] } Type=48 }\nlast event ={ StartNs=448052547615 StopNs=448084509068 GlobalId=349883374385042 Event={ TraceProcessEvent=[{ Correlation=209574 EventClass=1 TextId=920 ReturnValue=0 },] } Type=48 }"
      }
    }
  }
}
Generated:
    /central/scratch/esm/slurm-buildkite/climaatmos-ci/18394/climaatmos-ci/target_gpu_implicit_baroclinic_wave/output_active/report.qdstrm

and

/dvs/p4/build/sw/devtools/Agora/Rel/QuadD_Main/QuadD/Common/AgentAPI/Src/SessionImpl.cpp(18): rpc Start(.Agent.StartRequest) returns (.Agent.EmptyMessage);
 is canceled because the timeout period is expired
🚨 Error: The command exited with status 1
Sbozzolo commented 3 months ago

I think I fixed this at some point (at least, on clima). Is this still an issue?

charleskawczynski commented 3 months ago

Yeah, the original failure seems to be fixed, but it does look like one issue remains: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/330#019151a2-dc7f-4525-aba6-b92ea170dd76:

┌ Info: Progress
│   simulation_time = "4 hours, 49 minutes"
│   n_steps_completed = 193
│   wall_time_per_step = "945 milliseconds, 292 microseconds"
│   wall_time_total = "15 minutes, 7 seconds"
│   wall_time_remaining = "12 minutes, 5 seconds"
│   wall_time_spent = "3 minutes, 2 seconds"
│   percent_complete = "20.1%"
│   sypd = 0.261
│   date_now = 2024-08-14T09:56:47.422
└   estimated_finish_date = 2024-08-14T10:08:52.461
The target application terminated. One or more process it created re-parented.
Waiting for termination of re-parented processes.
Use the `--wait` option to modify this behavior.
Generating '/tmp/slurm-35905/nsys-report-09db.qdstrm'

Should we keep this issue open for this new error? The title is sufficiently general 🤷🏻‍♂️

Sbozzolo commented 3 months ago

Yes, at least this seems to be consistent. It is always with that particular job:

https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/329#01910503-1d89-4183-94b7-8c69e98619a2

Sbozzolo commented 1 month ago

The pipeline is still failing:

https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/349#019272fa-9aad-45fb-b6c0-375ad1481b51

Sbozzolo commented 1 month ago

Without nsight, the jobs run to completion: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/344

charleskawczynski commented 1 month ago

Without nsight, the jobs run to completion: https://buildkite.com/clima/climaatmos-target-gpu-simulations/builds/344

That's a different error, the reports are being generated, now it's OOMing. I'm going to close this and open a new issue.

charleskawczynski commented 1 month ago

Opened #3375.