Open jeremylt opened 1 month ago
ToDo: Track down all the CPU memory leaks, add GPU restores, language wrapers
Ok, rebased to work off the branch with the lingering Operator memory fix. Impl for CPU ref is in, rest in progress
CPU blocked and ref in, GPU non-gen next. (would need to rebase in another open PR to do gen)
OCCA will be annoying so I'm saving it for last
Ok, all backends should be covered, including SYCL (not tested)
The ToDo is handling this for Rust
Working on Rust, need to fix error handing though
Ok, should be good to review now
I've pushed up more fixes to jrwrigh/get-restore
, but there are still a lot of failures to do with the assembly. The failing tests are:
Test: t537-operator
Test: t569-operator
Test: t537-operator
Test: t569-operator
Test: t536-operator-f
Test: t533-operator
Test: t506-operator
Test: t534-operator-f
Test: t536-operator
Test: t570-operator
Test: t534-operator
Test: t533-operator-f
Test: t535-operator-f
Test: t535-operator
Test: t538-operator
Test: t506-operator-f
t570-operator
fails with:
/home/jrwrigh/software/libCEED/backends/sycl-ref/ceed-sycl-ref-operator.sycl.cpp:1086 in CeedSingleOperatorAssembleSetup_Sycl(): Cannot assemble operator without inputs/outputs
where it's num_eval_mode_out = 0
that is triggering the failure.
t504-operator
fails with:
Large Problem, Component 1: Computed Area 1.254524 != True Area 1.0
The rest are correctness failures in assembly.
I'd try limiting the SYCL changes to the absolute minimum required to get this PR merged.
Those changes in that branch look good, so feel free to push them to this branch
@jrwrigh I think I put back the pieces that were accidentally lost
Needed to apply the following diff to compile:
diff --git i/backends/sycl-ref/ceed-sycl-ref-operator.sycl.cpp w/backends/sycl-ref/ceed-sycl-ref-operator.sycl.cpp
index 9d2ecec6..a784e1e5 100644
--- i/backends/sycl-ref/ceed-sycl-ref-operator.sycl.cpp
+++ w/backends/sycl-ref/ceed-sycl-ref-operator.sycl.cpp
@@ -662,7 +662,7 @@ static inline int CeedOperatorAssembleDiagonalSetup_Sycl(CeedOperator op) {
CeedBasis basis;
CeedCallBackend(CeedOperatorFieldGetElemRestriction(op_fields[i], &elem_rstr));
- CeedCheck(!rstr_in || rstr_in == rstr, ceed, CEED_ERROR_BACKEND,
+ CeedCheck(!rstr_in || rstr_in == elem_rstr, ceed, CEED_ERROR_BACKEND,
"Backend does not implement multi-field non-composite operator diagonal assembly");
if (!rstr_in) CeedCallBackend(CeedElemRestrictionReferenceCopy(elem_rstr, &rstr_in));
CeedCallBackend(CeedOperatorFieldGetBasis(op_fields[i], &basis));
But I'm still seeing correctness failures for:
t535-operator
t536-operator
t537-operator
t533-operator
t533-operator-f
t506-operator
t536-operator-f
t535-operator-f
t506-operator-f
t538-operator
t534-operator-f
t537-operator
t534-operator
All assembly errors though, so you managed to squash some of the issues.
I'll keep plunking away, thanks for the cross-check!
t506
is the odd one, as that's using the same QF for 2 different CeedOperators
Note
make prove -j BACKENDS="/gpu/sycl/ref /gpu/sycl/gen" PROVE_OPTS=-v
will give the details on the failures for just those
Fixed t506, but the others are still failing: sorted this time
t533-operator
t533-operator-f
t534-operator
t534-operator-f
t535-operator
t535-operator-f
t536-operator
t536-operator-f
t537-operator
t537-operator
t537-operator
t538-operator
They fail with all the SYCL backends (ref, shared, and gen). Not too surprising since the assembly always goes through the ref backend, but just another sanity check.
t506
better is good. That narrows my search. Just operator diagonal left to dig into
Its the same routine for all of them, so now you only need to look at /gpu/sycl/ref
Found another difference. Any useful error messages in the test suite output with PROVE_OPTS=-v
or is it just "entry x is different" sort of stuff?
I've been running with make junit
to get the error messages. But yeah, they're just:
[0] Error in assembly: 0.000000 != 0.002963
[1] Error in assembly: 0.000000 != 0.011852
or similar.
Interesting, I wouldn't have expected getting 0s
Yeah, all the errors are in the form Error in assembly: 0.0000 != ....
Huzzah, at long last. I think this is good to add now, and I can spin up the companion PR to MFEM. Mildly disruptive but not a huge deal as it just leaks if not fixed downstream
Fixes #1639
What do we think @jrwrigh. You can see how this does get a bit invasive.
Ok, now the Vector/ElemRestriction/Basis retrieved from the OperatorField needs to be Destroyed per the corresponding interface.