Closed yohell closed 1 week ago
Report: Development of a Guest OS Base Image for SDS 2.0 with GPU Support
In response to the creation of a guest OS base image for GPU-enabled VMs for AI research on SDS 2.0, I have conducted thorough research and development to establish a first version base image that ensures stability, usability, and security. This image is tailored for AI researchers who require root access on private VMs. Our recommended base OS is Ubuntu Cloud Image in its 18.04, 20.04, and 22.04 LTS versions.
What are the other platforms doing?
Cloud platforms such as AWS (Amazon Web Services), Azure, and GCE (Google Compute Engine) provide a wide range of base OS images that users can use to launch virtual machines (VMs). These images are suitable for various applications, including general-purpose computing and specialized tasks like deep learning. Here's a summary of OS images you can find on these platforms:
Cloud providers offer specialized images that come with data science and machine learning tools pre-installed, including deep learning frameworks. These include:
AWS Deep Learning AMIs (Amazon Machine Images): These come with the latest versions of deep learning frameworks like TensorFlow, PyTorch, MXNet, and others, pre-installed on Ubuntu or Amazon Linux.
Azure Machine Learning VMs: Azure provides VM images specifically tailored for machine learning, including tools and environments like PyTorch, TensorFlow, and others.
Google Cloud AI Platform Deep Learning VM Image: This includes a comprehensive suite of machine learning and data science libraries, along with frameworks such as TensorFlow, PyTorch, Keras, and others.
NVIDIA GPU Cloud (NGC): Although not a cloud service provider itself, NVIDIA offers a range of Docker images that are optimized for GPUs and can be used on the above cloud providers. These images include NVIDIA's CUDA toolkit, deep learning frameworks, and HPC applications.
1. Base OS Selection Strategy:
Aspect | Detail |
---|---|
OS Choice | Ubuntu LTS (18.04, 20.04, 22.04) |
Considerations | Long-term support, stability, compatibility with AI tools |
Ubuntu was chosen for its widespread use in the AI research community, compatibility with most machine learning frameworks, and long-term support which is crucial for a stable research environment.
2. Customization Approaches:
Approach 1: Pre-Installed Package Image
Pro | Con |
---|---|
Immediate readiness | Limited flexibility for customized setups |
Consistent environment across users | Regular updates required to the image |
Controlled versions of pre-installed software | Larger image sizes |
Approach 2: On-Demand Installation Image
Pro | Con |
---|---|
Customizable and scalable | Longer VM initialization time |
Latest software versions available | Additional complexity during deployment |
Smaller base image sizes | Dependency on external package repositories |
3. Testing Tools and Requirements:
Tool | Purpose |
---|---|
KVM & QEMU | For creating and running virtual machines |
Cloud-init | To apply user-data for VM customization |
Guestfs-tools | For modifying and inspecting VM disks |
Libvirt | To manage virtual machines and networks |
OpenStack | For deploying in a cloud environment |
GPU Passthrough Testing:
Requirement | Challenge |
---|---|
Host Compatibility | Ensuring the NVIDIA driver works across all supported host OS versions |
VM Performance | Verifying the VM harnesses the full potential of the GPU for AI/ML tasks |
Stability | Consistent GPU behavior during prolonged operations |
4. Maintenance and Update Practices:
Current Scripts Maintenance
Scripting | Description |
---|---|
Bash Scripts | Used for basic automation of tasks including setup and updates |
CLI Tools | KVM, QEMU, and OpenStack command line interfaces for managing VM operations |
Planned Improvement Using Advanced Tools
Tool | Benefit |
---|---|
Packer | For creating identical machine images for multiple platforms |
Ansible | For automating software provisioning, configuration management, and application deployment |
5. Lessons Learned and Compatibility Challenges:
During our testing, we have learned that NVIDIA drivers included in a VM image may not be compatible across all hosts due to differences in the kernel or system libraries. For example, a VM image with NVIDIA drivers working on Ubuntu 22.04 host may fail on an Ubuntu 18.04 host due to driver incompatibility.
The first version of the guest OS base image focuses on providing a ready-to-deploy AI research environment with GPU support. A balance between out-of-the-box usability and customizability is essential. Continuous maintenance and regular updates are crucial to keep up with the latest AI frameworks and drivers.
Recommendations:
Category | Package | Description |
---|---|---|
Data Manipulation | NumPy | Array and matrices processing tool. |
Pandas | Data structures and analysis tools. | |
Dask | Scalable analytics in Python. | |
Machine Learning | Scikit-learn | Data mining and data analysis. |
TensorFlow | End-to-end open-source machine learning platform. | |
Keras | High-level neural networks API. | |
PyTorch | Machine learning library for tensor and neural network. | |
XGBoost | Gradient boosting framework. | |
LightGBM | Gradient boosting framework that uses tree-based learning. | |
CatBoost | Gradient boosting on decision trees library. | |
Statistical Modeling | Statsmodels | Estimation of statistical models. |
SciPy | Scientific and technical computing. | |
Natural Language Processing | NLTK | Toolkit for natural language processing. |
spaCy | Library for advanced natural language processing. | |
gensim | Library for unsupervised topic modeling and natural language processing. | |
Visualization | Matplotlib | 2D plotting library. |
Seaborn | Statistical data visualization based on Matplotlib. | |
Plotly | Interactive graphing library. | |
Bokeh | Interactive visualization library. | |
ggplot (for Python) | Declarative data visualization based on ggplot2 for R. | |
Data Importing/Wrangling | Beautiful Soup | Library for pulling data out of HTML and XML files (web scraping). |
requests | HTTP library for making requests. | |
Scrapy | Web crawling & scraping framework. | |
Development Tools | Jupyter Notebook/Lab | Interactive computing environment. |
IPython | Advanced interactive Python shell. | |
Data Storage/Big Data | SQLAlchemy | SQL toolkit and object-relational mapping for Python. |
PyMongo | MongoDB driver for Python. | |
pyspark | Apache Spark Python API. | |
h5py | Interface to HDF5 binary data format. | |
Deep Learning | FastAI | Simplifies training fast and accurate neural nets. |
Time Series Analysis | Prophet | Forecasting time series data. |
tsfresh | Automatic extraction of features from time series. | |
Optimization and Operations Research | PuLP | Linear programming library. |
Data Quality/Cleaning | Great Expectations | Tool for data quality and validation. |
Medical image analysis is a specialized field involving the examination of images from various medical imaging techniques such as X-rays, MRI, CT scans, ultrasound, and more. It requires specialized tools and libraries that can handle the unique formats and analysis needs associated with medical imagery.
These tools feature a range of functionalities required for medical image analysis, such as reading medical images in various formats, processing images, visualizing complex medical data, and performing complex tasks like segmentation, registration, and feature extraction. They are widely used by researchers and professionals in the field of medical imaging and computational radiology.
Tool/Library | Language | Description |
---|---|---|
MONAI (Medical Open Network for AI) | Python | Framework built on PyTorch for deep learning in healthcare imaging, part of the PyTorch Ecosystem. |
Cytomine-python-client | Python | Python client for interacting with Cytomine, an open-source platform for managing and analyzing large-scale biomedical imaging data. |
Tool/Library | Language | Description |
---|---|---|
ITK (Insight Segmentation and Registration Toolkit) | C++ (with bindings for Python) | Advanced toolkit for segmentation and registration. |
VTK (The Visualization Toolkit) | C++ (with bindings for Python) | Software system for 3D computer graphics, image processing, and visualization. |
SimpleITK | C++ (with bindings for Python) | Simplified layer built on top of ITK to facilitate its use in rapid prototyping, education, and research. |
NIfTI | Multiple | Data format for storing neuroimaging data. |
NiBabel | Python | Provides read/write access to NIfTI and other neuroimaging file formats. |
PyDicom | Python | Pure Python package to work with DICOM files. |
MedPy | Python | Library for medical image processing in Python with a focus on image classification. |
MITK (Medical Imaging Interaction Toolkit) | C++ | Framework for developing interactive medical image processing software. |
3D Slicer | C++ (with Python scripting) | Platform for medical image informatics, image processing, and three-dimensional visualization. |
Additionally, some of these tools (like ITK, VTK, and 3D Slicer) can be integrated with AI and machine learning libraries (such as TensorFlow and PyTorch) to develop advanced applications for medical diagnosis, treatment planning, and surgical simulation. MONAI, in particular, is designed to facilitate the application of deep learning to medical research and clinical applications.
We need to develop (and maintain) a good base image for SDS 2.0 guest VMs, which will be used for GPU enabled VMs which AI research experts will be able access in private and be root on.
This card is about creating a first version of this.
This is a "big card" suitable for tackling over summer.
Please feel free to break it down as needed into eg:
etc
The context is that SDS 2.0 will provide several different services and add-on services for processing sensitive data, one of which will be very similar to the current DGX-2 service. This will probably be one of the first services deployed on SDS 2.0, and probably the easiest to onboard current DGX-2 users to.