Open eapolinario opened 6 months ago
Hey @eapolinario, fancy seeing you here again! Hope you're doing well despite this tricky bug.
Based on the information you've provided, it seems like you're experiencing a segmentation fault when running the test_end_to_end
function from the test_elastic_task.py
file. Segmentation faults are usually caused by a program trying to read or write an illegal memory location.
Given that the error occurs in the c10d_rendezvous_backend.py
file, it's possible that the issue might be related to PyTorch's distributed package. This package is used for elastic training, which allows the training process to be resilient to failures and resource changes.
However, without more specific information about the error, it's hard to pinpoint the exact cause. It would be helpful if you could provide more details about the error message, such as the line number where the error occurs and the full stack trace.
In the meantime, you might want to check the versions of your dependencies. The test_end_to_end
function requires several dependencies to run successfully, including Python 3.11, pytest-8.0.2, pluggy-1.4.0, and others. If you're using Python 3.12.2, it's possible that there might be some compatibility issues.
You could try downgrading your Python version to 3.11 and see if the error still occurs. If it does, please provide more details about the error so we can further investigate the issue.
For more information about the test_end_to_end
function and its dependencies, you can refer to the .github/workflows/end2end.yml
file in the Flyte repository. This file contains the workflow for running end-to-end tests, including the test_end_to_end
function.
I hope this helps! If you have any other questions or need further clarification, please don't hesitate to ask.
Describe the bug
As the title says. Tests for the kf-pytorch plugin core dumps:
Expected behavior
Tests run on python 3.12
Additional context to reproduce
No response
Screenshots
No response
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?