cp2k / dbcsr

DBCSR: Distributed Block Compressed Sparse Row matrix library
https://cp2k.github.io/dbcsr/
GNU General Public License v2.0
134 stars 45 forks source link

Fixed #776 #777

Closed hfp closed 2 months ago

hfp commented 3 months ago
hfp commented 3 months ago

Issue #776 was discovered when testing with enabled assertions, i.e., DBCSR's CUDA tests may have assertions removed. Perhaps it is valuable to test with enabled assertions.

alazzaro commented 3 months ago

@hfp for my understanding:

In any case, your change makes sense to me. I think the entire assumption was that the first call to the ACC part was c_dbcsr_acc_set_active_device, assuming we call it only once, which is clearly not that case... I think we can move the call to a more convenient place...

alazzaro commented 3 months ago

(BTW, trying to recover the Daint-CI output...)

alazzaro commented 3 months ago

CSCS CI seems broken on their side:

+ sbatch --wait --time=0:15:00 --account=g90 --job-name=DBCSR.gnu.build --output=sbatch.jenkins-g90-DBCSR-1116.gnu.build.out .ci/daint.cscs.ch/gnu.build.sh
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
java.nio.file.NoSuchFileException: /users/jenkg90/workspace/g90/DBCSR/sbatch.jenkins-g90-DBCSR-1116.gnu.build.out

but we have budget... Please ignore it for the moment.

mkrack commented 3 months ago

The cp2k regression tests on Piz Daint are also disable, because the project g90 has expired. sbatch returns

project "g90" expired on 2024-03-31

@juerghutter could you have a look?

juerghutter commented 3 months ago

Project g90 is open again (until 2025-03-31).


From: Matthias Krack @.***> Sent: Friday, April 5, 2024 10:13 AM To: cp2k/dbcsr Cc: Jürg Hutter; Mention Subject: Re: [cp2k/dbcsr] Fixed #776 (PR #777)

The cp2k regression tests on Piz Dainthttps://dashboard.cp2k.org/index.html are also disable, because the project g90 has expired. sbatch returns

project "g90" expired on 2024-03-31

@juerghutterhttps://github.com/juerghutter could you have a look?

— Reply to this email directly, view it on GitHubhttps://github.com/cp2k/dbcsr/pull/777#issuecomment-2039219944, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD2WEURGQJDACA44SJJ3L7TY3ZMMNAVCNFSM6AAAAABFXKHJWWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZZGIYTSOJUGQ. You are receiving this because you were mentioned.Message ID: @.***>

hfp commented 2 months ago

@hfp for my understanding:

* We don't see the issue since we NDEBUG is defined? But we really don't define it in CP2K tests...

Ok, I just assumed this because the issue came up when I removed NDEBUG. I will check grep through CP2K just to make sure there is nothing else.

* We really never call printf on the GPU, do we?

No, we don't. Perhaps someone did so during development and wanted to keep this setting.

* I don't understand the relation between NDEBUG and [ACC_API_CALL](https://github.com/cp2k/dbcsr/blob/6db5b28d236de28e7293f783a3c2cc672d202f6b/src/acc/cuda/acc_cuda.h#L29)

ACK; see above (me neither ;-).

In any case, your change makes sense to me. I think the entire assumption was that the first call to the ACC part was c_dbcsr_acc_set_active_device, assuming we call it only once, which is clearly not that case... I think we can move the call to a more convenient place...

OK, this is good to go in principle. However, I will move the call into the init function.

hfp commented 2 months ago

I think we can move the call to a more convenient place...

What do you suggest? Putting it into acc_init may not be the right thing as it is device specific.

I wonder if the code in question should be removed entirely?

hfp commented 2 months ago

I rebased the PR and if it's green (let's hope for Daint-CI), I will merge it. Removing (or moving) the code in question might be another PR.

alazzaro commented 2 months ago

ACK.