LLNL / merlin

Machine Learning for HPC Workflows
MIT License
118 stars 26 forks source link

bugfix/celery-chord-error #481

Closed bgunnar5 closed 3 months ago

bgunnar5 commented 3 months ago

Added a patch to Celery's result backend so that ChordErrors stop getting raised and breaking workflows. I've also added a new step return code that's mainly for testing.

bgunnar5 commented 3 months ago

Not sure what's causing python 3.7 to break when it runs the new chord error test... Maybe this is a good time to discuss whether we should drop support for it or not. It is already at EOL and has been for some time now. Thoughts?

lucpeterson commented 3 months ago

Not sure what's causing python 3.7 to break when it runs the new chord error test... Maybe this is a good time to discuss whether we should drop support for it or not. It is already at EOL and has been for some time now. Thoughts?

Yeah it’s hard to tell with the existing logs. Can we reproduce somehow or add verbosity?

bgunnar5 commented 3 months ago

Just downloaded python 3.7 from source now to try to reproduce. The verbosity should already be turned on. Maybe we just need to add more to the code itself

bgunnar5 commented 3 months ago

image Ok now I'm really confused on why the test isn't passing. I just tried with python 3.7 using 2 different sleep times for the test and it's working just fine.

lucpeterson commented 3 months ago

Maybe add a command to the test that prints the tree of files? I wonder if it’s something about the container and maybe running out of space?

bgunnar5 commented 3 months ago

hmmm either the sleep command isn't long enough in the test for Python 3.7 or there's something in the setup that's breaking the test. Perhaps it's something to do with the Celery/Redis versions.

EDIT: Looks like for Python 3.8+ we're using Celery 5.4.0 but for Python 3.7 it's at 5.2.7. I'll try to reproduce locally and see what happens.

EDIT 2: Just checked the virtual environment that I had used previously to test Python 3.7 and it also has Celery 5.2.7 but runs just fine.

bgunnar5 commented 3 months ago

So looks like a higher sleep time is allowing it to progress further in the processing of the workflow for the new test. Not sure what is taking it so long to process the workflow compared to other python versions. I'll probably just modify the sleep time based on the python version for now.

bgunnar5 commented 3 months ago

@lucpeterson this should be ready to go now

lucpeterson commented 3 months ago

I’m wondering what happens if this is run in —local mode. I’m guessing the chord error still happens in that case, since this is a patch just for redis, right?

bgunnar5 commented 3 months ago

I just tried a run in local mode and it went all the way through the workflow

lucpeterson commented 3 months ago

I just tried a run in local mode and it went all the way through the workflow

If you remove the “fix” does the test fail? Eg j in wonder if there’s a test that’s something like

merlin —no-patch run —local chord_error.yaml -> must fail merlin run —local chord_error.yaml —> success

lucpeterson commented 3 months ago

I just tried a run in local mode and it went all the way through the workflow

If you remove the “fix” does the test fail? Eg j in wonder if there’s a test that’s something like

merlin —no-patch run —local chord_error.yaml -> must fail merlin run —local chord_error.yaml —> success

And if —local is a sufficient test we could get rid of the workers and sleep problem and add this to a local test instead of a distributed test

bgunnar5 commented 3 months ago

So it looks like the local run will pass with or without the patch. However, the distributed run does not so I think we have to keep this sleep business embedded in the test for now. I have another branch that's working on updating the test suite so maybe once I work on that more we can use some of those changes to resolve this.