ECP-copa / Cabana

Performance-portable library for particle-based simulations
Other
197 stars 51 forks source link

Merge Cajita #235

Closed sslattery closed 4 years ago

sslattery commented 4 years ago

Adds the Cajita source and attempts a preliminary unification of the build. Cajita now sits in a separate package in the cajita/ directory. Everything sits in its own namespace for now - we can clean things up once the initial merge is complete.

One of the preliminary unification mechanisms is the handling of dependencies. Dependencies are now automatically checked for but a user can require them if they desire. We had discussed doing it this way before but if there are objections we should discuss. I could maybe see a situation where a dependency was in a system path but you didn't want to build against it because it didn't work or something? Perhaps the workaround here is an explicit enable/disable for dependencies.

sslattery commented 4 years ago

@junghans would appreciate your input on this as I probably messed stuff up. I know the install is not currently working.

junghans commented 4 years ago

@sslattery you don't want to merge the caijta history?

sslattery commented 4 years ago

@sslattery you don't want to merge the caijta history?

I looked into the merge unrelated history option. It seemed to me that it would try to merge both together at the top level directory which would create more of a mess than I wanted with the build. Is this true or can I use that option to merge it in as a subdirectory?

I'm not too attached to the idea of keeping around that old history.

junghans commented 4 years ago

@sslattery you don't want to merge the caijta history?

I looked into the merge unrelated history option. It seemed to me that it would try to merge both together at the top level directory which would create more of a mess than I wanted with the build. Is this true or can I use that option to merge it in as a subdirectory?

I'm not too attached to the idea of keeping around that old history.

Yes, you will have to merge the cajita source into a subdirectory before the merge.

sslattery commented 4 years ago

With the one config change and adding back ArborX on jenkins everything built and passed tests.

I forgot we had pulled ArborX from the Jenkins. I'm seeing example build errors on Travis likely due to the bad install. Did you resolve those as well?

junghans commented 4 years ago

@sslattery now it has all the cajita history with 2c1e6b9 and 0b974e1 being identical contentwise.

sslattery commented 4 years ago

@junghans awesome work - now looks like I have some tests to clean up

sslattery commented 4 years ago

The CUDA build is passing and I have recreated the HIP errors on Jenkins in docker so working on that now. No progress on the stack smashing as my valgrind came back clean.

rfbird commented 4 years ago

The CUDA build is passing and I have recreated the HIP errors on Jenkins in docker so working on that now. No progress on the stack smashing as my valgrind came back clean.

For the stack smashing stuff, I'm pretty sure GCC has flags to try and help you find it. Some documentation here: https://wiki.osdev.org/Stack_Smashing_Protector

Basically:

-fstack-protector: Check for stack smashing in functions with vulnerable objects. This includes functions with buffers larger than 8 bytes or calls to alloca.

-fstack-protector-strong: Like -fstack-protector, but also includes functions with local arrays or references to local frame addresses.

-fstack-protector-all: Check for stack smashing in every function.

Some operating systems have extended their compiler with more relevant options:

-fstack-shuffle: (Found in OpenBSD) Randomize the order of stack variables at compile time. This helps find bugs.

There's also apparently a tool called Mudflap, that can be used to find some stack smashing stuff: "adds runtime error checking for pointers that are typically the cause for many programming errors" (http://www.qnx.com/developers/docs/6.5.0/index.jsp?topic=%2Fcom.qnx.doc.ide.userguide%2Ftopic%2Fdebug_UsingMudflapInIDE_.html)

I'll see if i can re-create this locally with the flags, and let you know

rfbird commented 4 years ago

The CUDA build is passing and I have recreated the HIP errors on Jenkins in docker so working on that now. No progress on the stack smashing as my valgrind came back clean.

For the stack smashing stuff, I'm pretty sure GCC has flags to try and help you find it. Some documentation here: https://wiki.osdev.org/Stack_Smashing_Protector

Basically:

-fstack-protector: Check for stack smashing in functions with vulnerable objects. This includes functions with buffers larger than 8 bytes or calls to alloca. -fstack-protector-strong: Like -fstack-protector, but also includes functions with local arrays or references to local frame addresses. -fstack-protector-all: Check for stack smashing in every function. Some operating systems have extended their compiler with more relevant options: -fstack-shuffle: (Found in OpenBSD) Randomize the order of stack variables at compile time. This helps find bugs.

There's also apparently a tool called Mudflap, that can be used to find some stack smashing stuff: "adds runtime error checking for pointers that are typically the cause for many programming errors" (http://www.qnx.com/developers/docs/6.5.0/index.jsp?topic=%2Fcom.qnx.doc.ide.userguide%2Ftopic%2Fdebug_UsingMudflapInIDE_.html)

I'll see if i can re-create this locally with the flags, and let you know

I'm having a hard time recreating the stack smashing either locally on our cluster. I guess the best method forward is to either change travis versions or add flags to the build there and hope...

sslattery commented 4 years ago

OK HIP build is working. @dalg24 what do you make of the Jenkins CUDA errors? They have no real info that I can discern and I did not get those errors when I built with the CUDA docker image on my machine.

sslattery commented 4 years ago

retest this please

sslattery commented 4 years ago

OK now BovWriter test is failing on CUDA which leads me to believe there is a problem with that test.

sslattery commented 4 years ago

OK now BovWriter test is failing on CUDA which leads me to believe there is a problem with that test.

Was not able to reproduce this error on CADES-condo with P100s. We might need to see if we can get on the test system to check. I did add a fence but that wouldn't explain the OpenMP issue.

sslattery commented 4 years ago

OK now BovWriter test is failing on CUDA which leads me to believe there is a problem with that test.

Was not able to reproduce this error on CADES-condo with P100s. We might need to see if we can get on the test system to check. I did add a fence but that wouldn't explain the OpenMP issue.

Now getting another CUDA failure on Jenkins - seems random to me

sslattery commented 4 years ago

I tempted to give up on the BovWriter, make it experimental, and disable the test or something

sfogerty commented 4 years ago

I would support making BovWriter experimental. I have tried to reproduce the error and could not.

I think setting this aside is wise - don't want it to hold this PR up. (PS - having trouble calling in to the meeting so contributing any comments where I can)