[CI][Python][C++] Support on Power Architecture

sandeepgupta12 commented 3 months ago

Describe the enhancement requested

Description: We need to extend support for apache/arrow to the POWER/PPC64LE architecture.

Background: • We have forked the apache/arrow repository and have successfully generated and tested wheels for both C++ and Python using a self-hosted CI runner on an OSU PPC64LE machine. • The changes in the forked repository include following changes:

Added a job for PPC64LE in .github/workflows/cpp.yaml and .github/workflows/python.yaml, along with corresponding updates to docker-compose.yaml.
1. Created new Dockerfiles for C++ and Python: ci/docker/ppc64le-cpp.dockerfile and ci/docker/ppc64le-python.dockerfile.
2. Added build and test scripts for Python: ci/scripts/ppc64le_python_build.sh and ci/scripts/ppc64le_python_test.sh.

• We would like to upstream these changes to enable CI for ppc64le arch using GHA self-hosted runner.

Fork Information: • Forked Repository: https://github.com/sandeepgupta12/arrow

Request: • Support for PPC64LE: We are seeking support for the PPC64LE architecture for the apache/arrow project. • Creation of OSU VM: To facilitate further testing and CI integration, we request the creation of an OSU VM configured for PPC64LE. Below are the details where you can create the OSU VM- URL- https://osuosl.org/services/powerdev/request_hosting/ IBM Advocate- gerrit@us.ibm.com

Details: The Open Source Lab (OSL) at Oregon State University (OSU), in partnership with IBM, provides access to IBM Power processor-based servers for developing and testing open source projects. The OSL offers following clusters: OpenStack (non-GPU) Cluster: • Architecture: Power little endian (LE) instances • Virtualization: Kernel-based virtual machine (KVM) • Access: Via Secure Shell (SSH) and/or through OpenStack's API and GUI interface • Capabilities: Ideal for functional development and continuous integration (CI) work. It supports a managed Jenkins service hosted on the cluster or as a node incorporated into an external CI/CD pipeline.

Additional Information: • We are prepared to provide any further details or assistance needed to support the PPC64LE architecture. Please let us know if there are any specific requirements or steps needed to move forward with this request.

Component(s)

C++, Python

raulcd commented 3 months ago

Thanks for reviving this. Since we moved away from Travis we stopped testing with little endian. I remember @kiszk discussing about using osuosl for s390x here: https://github.com/apache/arrow/pull/35374#issuecomment-1541882889 I am concerned about the security implications on managing those boxes. Is this done by OSL? Are the VMs ephemeral or are they long living? Do we have to ask ASF infra (@assignUser?)

kiszk commented 3 months ago

Yes, I talkwd about OSL. But, I recently changed my idea to use GHA self-hosted runner since I saw this article. https://community.ibm.com/community/user/powerdeveloper/blogs/gerrit-huizenga/2024/03/06/github-actions-runner-for-ibm-power-and-linuxone

assignUser commented 3 months ago

I agree with @raulcd, we can not support any non-ephemeral VM runners due to security reasons, they are a much to big risk in a public repo. This has been used to compromise major open-source repos before: https://www.legitsecurity.com/blog/github-pytorch-and-more-organizations-found-vulnerable-to-self-hosted-runner-attacks

I'd be happy to add power runners if they are ephemeral (-> vm get's destroyed after each job) which we currently have for arm runners using k8s: https://github.com/voltrondata-labs/gha-controller-infra

kiszk commented 3 months ago

@raulcd @assignUser Thank you for sharing useful information.

As far as I know, this self-hosted runner framework for ppc64le and s390x uses ephemeral VM.

assignUser commented 2 months ago

@kiszk No I don't think it is, the ephemeral there is talking about the image and how it needs to be build with the runner token to work, at least that's how I read it.

As the line where it starts the runner doesn't have any mechanism to kill the container and start a new one for a new job (as would be required for ephemeral runners). Which is what the controller is for, it starts a new container/runner for each job and removes the old one.

anup-kodlekere commented 2 months ago

@assignUser Hi! If ephemerality is the concern then we can set the config parameters to launch ephemeral LXD containers, that wouldn't be an issue. You would still need to follow the instructions in https://github.com/anup-kodlekere/gaplib, the only thing that changes is how the containers are deployed and managed. However, we haven't tested use-case before and would need to run some tests to ensure functional correctness. A simple systemd service running a python/bash script will act like a controller in this case, which will launch a clean LXD build environment (within the same VM host) for each new job.

kiszk commented 2 months ago

@anup-kodlekere Thanks, great to hear that. If changes in the instruction are prepared, I could try it for the arrow for s390x as a test.

pitrou commented 2 months ago

By the way, please add the Continuous Integration label to CI-related tasks, so that we can find them using a search :-)

raulcd commented 6 days ago

A couple of issues have been identified lately around Big Endian architectures, those probably would have been found if we tested on s390x:

apache / arrow

[CI][Python][C++] Support on Power Architecture #43817

Describe the enhancement requested

Component(s)