`SimpleMultiG` sediment models will not run on GPU

jagoosw commented 1 year ago

While the kernel for SimpleMultiG sediments will compile on GPU they will not run due to their complexity leading to a PTX error about parameter size being thrown. This may be fixed in the future if LLVM updates its CUDA version support (https://github.com/JuliaGPU/CUDA.jl/issues/2080). We may also be able to rework the model to use less parameter space so please do get in contact if you would like to use the model on GPU.

An error similar to the following will be raised:

Failed to compile PTX code (ptxas exited with code 255)
Invocation arguments: --generate-line-info --verbose --gpu-name sm_75 --output-file /tmp/jl_bELIheLkzO.cubin /tmp/jl_VczMJifKRu.ptx
ptxas /tmp/jl_VczMJifKRu.ptx, line 3215; error   : Entry function '_Z26gpu__calculate_tendencies_16CompilerMetadataI10StaticSizeI6_3__3_E12DynamicCheckvv7NDRangeILi2ES0_I6_1__1_ES0_I8_16__16_EvvEE12SimpleMultiGI7Float6410NamedTupleI24__A___B___C___D___E___F_5TupleIS4_S4_S4_S4_S4_S4_EES5_I24__A___B___C___D___E___F_S6_IS4_S4_S4_S4_S4_S4_EES5_I24__A___B___C___D___E___F_S6_IS4_S4_S4_S4_S4_S4_EES5_I24__A___B___C___D___depth_S6_IS4_S4_5Int64S4_S4_EES5_I52__C_slow___C_fast___N_slow___N_fast___C_ref___N_ref_S6_I11OffsetArrayIS4_Li3E13CuDeviceArrayIS4_Li3ELi1EEES8_IS4_Li3ES9_IS4_Li3ELi1EEES8_IS4_Li3ES9_IS4_Li3ELi1EEES8_IS4_Li3ES9_IS4_Li3ELi1EEES8_IS4_Li3ES9_IS4_Li3ELi1EEES8_IS4_Li3ES9_IS4_Li3ELi1EEEEES5_I10__G____G__S6_IS5_I52__C_slow___C_fast___N_slow___N_fast___C_ref___N_ref_S6_IS8_IS4_Li3ES9_IS4_Li3ELi1EEES8_IS4_Li3ES9_IS4_Li3ELi1EEES8_IS4_Li3ES9_IS4_Li3ELi1EEES8_IS4_Li3ES9_IS4_Li3ELi1EEES8_IS4_Li3ES9_IS4_Li3ELi1EEES8_IS4_Li3ES9_IS4_Li3ELi1EEEEES5_I52__C_slow___C_fast___N_slow___N_fast___C_ref___N_ref_S6_IS8_IS4_Li3ES9_IS4_Li3ELi1EEES8_IS4_Li3ES9_IS4_Li3ELi1EEES8_IS4_Li3ES9_IS4_Li3ELi1EEES8_IS4_Li3ES9_IS4_Li3ELi1EEES8_IS4_Li3ES9_IS4_Li3ELi1EEES8_IS4_Li3ES9_IS4_Li3ELi1EEEEEEES9_IS7_Li2ELi1EEE7LOBSTERIS4_3ValI18_true__true__true_ES5_I14__sPOM___bPOM_S6_IS5_I12__u___v___w_S6_I13ConstantFieldIS7_Li3EES12_IS7_Li3EES8_IS4_Li3ES9_IS4_Li3ELi1EEEEES5_I12__u___v___w_S6_IS12_IS7_Li3EES12_IS7_Li3EES8_IS4_Li3ES9_IS4_Li3ELi1EEEEEEEE15RectilinearGridIS4_8PeriodicS14_7BoundedS4_S4_S4_S8_IS4_Li1...
ptxas /tmp/jl_VczMJifKRu.ptx, line 3215; error   : Feature 'Kernel parameter size larger than 4352 bytes' requires PTX ISA .version 8.1 or later
ptxas fatal   : Ptx assembly aborted due to errors
If you think this is a bug, please file an issue and attach /tmp/jl_VczMJifKRu.ptx

Also, note that InstantRemineralisation sediment does work on GPU.

Discussed in https://github.com/OceanBioME/OceanBioME.jl/pull/138.

glwagner commented 1 year ago

While the kernel for SimpleMultiG sediments

where is this kernel? Link to code?

jagoosw commented 1 year ago

This is as it currently is in main: https://github.com/OceanBioME/OceanBioME.jl/blob/ac8419a1ce5a06a82be31588583636b212c09598/src/Boundaries/Sediments/simple_multi_G.jl#L159-L234 Which has another bug in it which is fixed in #138

glwagner commented 1 year ago

Interesting. I'm surprised this won't compile. Likely you can simplify the arguments to the kernel to get it to compile. Parameter space issues don't have to do with kernel complexity directly, but rather than complexity of the input arguments (ie using Field rather than OffsetArray). Reducing the input argument complexity is why, for example, we can't support Field directly in GPU kernels and are forced to work with OffsetArrays instead.

One issue might be that the time-stepper is input directly. You only need the named tuple Gⁿ --- it could be better to use that as an input argument instead.

sediment may also have more info than you need on the GPU, which you can simplify with an appropriate adapt_structure method.

In general, its good practice to define adapt_structure to return the minimal info expected to be useful within a GPU kernel so as much is compilable as possible.

jagoosw commented 1 year ago

This is the fixed version that does some of those things: https://github.com/OceanBioME/OceanBioME.jl/blob/8f133dfe5301e9241eddb3a5e5a74b0c0559142e/src/Boundaries/Sediments/simple_multi_G.jl#L175-L246 I think the adapt structure route may be the solution, e.g. at the moment we're unnecessarily passing the G$^{-1}$ tendency fields for the sediment fields, but otherwise this kernel does need the majority of the info in sediment.

glwagner commented 1 year ago

Also, this line does not belong inside the kernel:

     if isnan(pₐₙₒₓ) 
             error("Sediment anoxia has caused model failure") 
         end

If you want to check for NaNs, you should do that outside the kernel. But note that the NaNChecker can also be configured to check for NaNs in any field. You may want to use that instead.

jagoosw commented 1 year ago

Oh yeah that's a good idea. Perhaps that is the better solution to https://github.com/OceanBioME/OceanBioME.jl/issues/144 as well

jagoosw commented 1 year ago

This is the fixed version that does some of those things:

https://github.com/OceanBioME/OceanBioME.jl/blob/8f133dfe5301e9241eddb3a5e5a74b0c0559142e/src/Boundaries/Sediments/simple_multi_G.jl#L175-L246

I think the adapt structure route may be the solution, e.g. at the moment we're unnecessarily passing the G$^{-1}$ tendency fields for the sediment fields, but otherwise this kernel does need the majority of the info in sediment.

This worked!

jagoosw commented 1 year ago

Hmm actually this reduced the parameter size enough on for the test with a rectilinear grid for non hydrostatic and hydrostatic, but still fails on long/lat grid

jagoosw commented 1 year ago

We're also passing a load of repeated information about the advection schemes, maybe that could be reduced

glwagner commented 1 year ago

You want to reduce parameter size to the minimum needed. If you are near the limit, your code could fail when CUDA updates for example. It will also be difficult to develop the code. You should inspect every input argument and ensure that only the minimum is loaded onto the GPU.

jagoosw commented 1 year ago

We do need all of the entries I think but I think the question is if we need to pass all of model.advection because it is very large. I will try to work out a way to just take the ones we need

glwagner commented 1 year ago

You may need all the entries, but you may be missing adapt_structure methods for them.

You probably want to pass just the tracer advection right? For hydrostatic models, the tracer and momentum schemes are stored separately anyways.

jagoosw commented 1 year ago

For the advection schemes, I think we can reduce it by just passing the advection schemes for the tracers which sink.

I'm not sure where else to reduce the size though. Currently, we have:

sediment, underlying_biogeochemistry, grid, advection_schemes, tracers, timestepper.Gⁿ, tendencies.Gⁿ

I've written adapt structures for sediment so it just has the parameters and fields, underlying_biogeochemistry which always just has parameters and the sinking velocities, advection schemes which is either a scheme or NamedTuple, tracers and both tendency sets which are NamedTuples. All of these things are required because they're used or modified, or used to work out the flux of the sinking tracers.

glwagner commented 1 year ago

Which properties have fields embedded, and can you link to the code for those adapt_structure so we can see them here?

jagoosw commented 1 year ago

The sediment models have fields (and tendencies) which are excluded in the adapt_structure

https://github.com/OceanBioME/OceanBioME.jl/blob/5abb4217ab984f2cd32f2bee8742201c68f7e59b/src/Boundaries/Sediments/simple_multi_G.jl#L149-L163

and

https://github.com/OceanBioME/OceanBioME.jl/blob/5abb4217ab984f2cd32f2bee8742201c68f7e59b/src/Boundaries/Sediments/instant_remineralization.jl#L76-L82

and the models have sinking_velocities which are named tuples of named tuples of fields:

https://github.com/OceanBioME/OceanBioME.jl/blob/5abb4217ab984f2cd32f2bee8742201c68f7e59b/src/Models/AdvectedPopulations/NPZD.jl#L306-L321

and

https://github.com/OceanBioME/OceanBioME.jl/blob/5abb4217ab984f2cd32f2bee8742201c68f7e59b/src/Models/AdvectedPopulations/LOBSTER/LOBSTER.jl#L402-L433

We also have this:

https://github.com/OceanBioME/OceanBioME.jl/blob/5abb4217ab984f2cd32f2bee8742201c68f7e59b/src/Utils/sinking_velocity_fields.jl#L29

which I think might be redundant

jagoosw commented 1 year ago

I guess some of the complexity for SimpleMultiG is also that the _params properties are tuples. Do they need to be adapted?

glwagner commented 1 year ago

I guess some of the complexity for SimpleMultiG is also that the _params properties are tuples. Do they need to be adapted?

perhaps! It's conservative to propagate the adapt to every property.

Note that you also want to write your code to be robust to change in the future. When you assume that you don't need to adapt some property, you implicitly prevent someone from improving / extending your model to properties that do need adaptation. By conservatively adapting everything, you grease the wheels for scientific advancement in the future.

glwagner commented 1 year ago

If you call adapt on a tuple, it will call adapt on every one of it's properties:

https://github.com/JuliaGPU/Adapt.jl/blob/df06bcb6936baa7352b8cc7bf5f08f98f2653f25/src/base.jl#L3

The basic structure of any adapt for a custom struct should follow the same logic. The extra thing that custom structs can do it to completely throw away unneeded properties (ie set them to nothing). Or other bespoke actions.

jagoosw commented 1 year ago

Okay this is a good point, I'll update all the adapts to make sure everything gets adapted.

Looking at the error message here:

a lot of the information in lines 2 to 4 is about those named tuples so I'll have a go running it with them changed to tuples

jagoosw commented 1 year ago

That took 8 bytes off the parameter size so wasn't very successful

jagoosw commented 1 year ago

Vectors worked a bit better taking 56 bytes off. For some reason forcing it to only pass one of the advection schemes (rather than an NamedTuple of them) doesn't save any

jagoosw commented 1 year ago

Our problems are solved: https://github.com/JuliaGPU/CUDA.jl/issues/2080 !!

glwagner commented 1 year ago

Wow, that's huge. Hopefully there isn't a catastrophic loss of performance...

Are you sure the function that failed is the one we are concerned about? It's not clear from the error.

glwagner commented 1 year ago

and the models have sinking_velocities which are named tuples of named tuples of fields:

This nested structure is often the cause for issues. You can try to flatten it, perhaps.

jagoosw commented 1 year ago

Yeah hopefully, I can't find docs about the ISA change that allows this, but presumably someone down the line has tested the performance change.

And yeah its this https://github.com/OceanBioME/OceanBioME.jl/blob/5abb4217ab984f2cd32f2bee8742201c68f7e59b/src/Boundaries/Sediments/simple_multi_G.jl#L175

method (there was more detail somewhere else in the full error that confirmed it to me).

jagoosw commented 1 year ago

This nested structure is often the cause for issues. You can try to flatten it, perhaps.

Thinking about it we never have u or v components of slip velocity so we can reduce the complexity to just named tuple. Will test.

jagoosw commented 1 year ago

So near, with all of the optimisation above and removing the u and v components from sinking velocities its still 8bytes too large.

glwagner commented 1 year ago

Out of curiosity (wondering if something I said was wrong) --- does it matter what the body of the kernel is? For example you could comment everything out except perhaps something trivial.

glwagner commented 1 year ago

As a fallback solution, you might try separating these calculations into multiple kernels.

It might be a good idea anyways to reorganize the code so that it's easily to toggle back and forth. For example, right now the tendency calculation for the different species are intertwined.

jagoosw commented 1 year ago

Out of curiosity (wondering if something I said was wrong) --- does it matter what the body of the kernel is? For example you could comment everything out except perhaps something

It does still fail

jagoosw commented 1 year ago

As a fallback solution, you might try separating these calculations into multiple kernels.

It might be a good idea anyways to reorganize the code so that it's easily to toggle back and forth. For example, right now the tendency calculation for the different species are intertwined.

I could see this being a better idea.

It is now working (by only giving it the tracers and tendencies it needs) but I assume the parameter size is close to the limits,is there a way for me to check?

I think I would rather make restructuring how the sediment tendencies are calculated a different PR since this is getting quite long now?

glwagner commented 1 year ago

Of course, no need to solve the world in one PR.

jagoosw commented 1 year ago

Initial issue closed by #138 and remainder superceeded by #147

OceanBioME / OceanBioME.jl

`SimpleMultiG` sediment models will not run on GPU #143