eth-cscs / DLA-Future

DLA-Future
https://eth-cscs.github.io/DLA-Future/master/
BSD 3-Clause "New" or "Revised" License
64 stars 14 forks source link

Don't use `hipMemcpyDefault` in 2D memcpys due to bugs in HIP #1106

Closed msimberg closed 6 months ago

msimberg commented 7 months ago

2D memcpys involving host buffers with non-zero offsets are broken on HIP 5.6 onwards. Some cases have been fixed in https://github.com/ROCm/clr/commit/56daa6c4891b43ec233e9c63f755e3f7b45842b4 (included in 5.7.0) and further fixes in are in https://github.com/ROCm/clr/commit/d3bfb55d7a934355257a72fab538a0a634b43cad (not released as of this PR).

Where it's straightforward to do so I've simply replaced whip::memcpy_default with the explicit kind. In lacpy we don't know the direction anymore so I've added an additional parameter for the kind that has to be passed by the caller. Finally, I have not changed all instances of whip::memcpy_default (e.g. in 1D memcpys) though I wouldn't oppose that at all as I find being explicit is rarely a bad thing.

msimberg commented 7 months ago

cscs-ci run

msimberg commented 7 months ago

cscs-ci run

rasolca commented 7 months ago

I would prefer not to add the extra parameter to lacpy.

Did you considered to use hipPointerGetAttributes to identify the MemcpyKind directly in lacpy only for buggy versions of rocm and continue using MemcpyDefault when it works?

msimberg commented 7 months ago

I would prefer not to add the extra parameter to lacpy.

Did you considered to use hipPointerGetAttributes to identify the MemcpyKind directly in lacpy only for buggy versions of rocm and continue using MemcpyDefault when it works?

I did not, but that sounds like a good idea! I'll check if we can use that and update to use that instead if that's the case.

msimberg commented 7 months ago

cscs-ci run

msimberg commented 7 months ago

cscs-ci run

msimberg commented 7 months ago

@rasolca I think using hipPointerGetAttributes is working just fine. I was able to run a small set of benchmarks with the latest change that otherwise would fail with master. I added a couple of assertions:

rasolca commented 7 months ago

The enum contains different values depending on the rocm version. https://docs.amd.com/projects/HIP/en/docs-5.6.0/.doxygen/docBin/html/hip__runtime__api_8h.html#aea86e91d3cd65992d787b39b218435a3

For sure hipMemoryTypeUnregistered has to be considered as it represent system allocated memory. (For 5.6 I suppose that it works as for CUDA which returns an error).

msimberg commented 7 months ago

https://docs.amd.com/projects/HIP/en/latest/doxygen/html/group___memory.html#ga7c3e8663feebb7be9fd3a1e5139bcefc seems to indicate that maybe even the newer versions return an error for unregistered memory, but it's not 100% clear from the description. I'll try it out and see what it returns.

msimberg commented 7 months ago

@rasolca indeed 5.6.1 returns hipErrorInvalidValue with a mallocd pointer, 6.0.2 gives hipMemoryTypeUnregistered.

We could:

Perhaps worth discussing in today's meeting.

msimberg commented 7 months ago

cscs-ci run

msimberg commented 7 months ago

cscs-ci run

msimberg commented 7 months ago

cscs-ci run

msimberg commented 7 months ago

Just a minor comment: wondering if this workaround shouldn't be moved to whip. Feel free to ignore this, we can always move it if we decide it's a good idea.

It's a very good point, but at least at this point I'm not confident enough in the workaround to put it in whip. I know it works for our use cases and I think it works for others, but I really haven't tested it extensively enough to put it into whip right now.