NBISweden / aida-data-hub

AIDA Data Hub Scrum team board
1 stars 1 forks source link

Demo: First version guest OS base image for SDS 2.0 #578

Open yohell opened 2 months ago

yohell commented 2 months ago

We need to develop (and maintain) a good base image for SDS 2.0 guest VMs, which will be used for GPU enabled VMs which AI research experts will be able access in private and be root on.

This card is about creating a first version of this.

This is a "big card" suitable for tackling over summer.

Please feel free to break it down as needed into eg:

etc

The context is that SDS 2.0 will provide several different services and add-on services for processing sensitive data, one of which will be very similar to the current DGX-2 service. This will probably be one of the first services deployed on SDS 2.0, and probably the easiest to onboard current DGX-2 users to.

minha12 commented 2 weeks ago

Report: Development of a Guest OS Base Image for SDS 2.0 with GPU Support

In response to the creation of a guest OS base image for GPU-enabled VMs for AI research on SDS 2.0, I have conducted thorough research and development to establish a first version base image that ensures stability, usability, and security. This image is tailored for AI researchers who require root access on private VMs. Our recommended base OS is Ubuntu Cloud Image in its 18.04, 20.04, and 22.04 LTS versions.


I. OS base image

What are the other platforms doing?

Cloud platforms such as AWS (Amazon Web Services), Azure, and GCE (Google Compute Engine) provide a wide range of base OS images that users can use to launch virtual machines (VMs). These images are suitable for various applications, including general-purpose computing and specialized tasks like deep learning. Here's a summary of OS images you can find on these platforms:

General-Purpose Base OS Images:

  1. Ubuntu (18.04, 20.04, 22.04 LTS versions are common).
  2. Red Hat Enterprise Linux (RHEL).
  3. CentOS (though CentOS 8 reached end-of-life, CentOS Stream is available).
  4. Amazon Linux (Amazon Linux 2 is commonly used on AWS).
  5. Debian (various stable versions).
  6. Fedora.
  7. SUSE Linux Enterprise Server (SLES).
  8. Windows Server (various versions, like 2019, 2022).
  9. Oracle Linux.

Base OS Images for Deep Learning and Machine Learning:

Cloud providers offer specialized images that come with data science and machine learning tools pre-installed, including deep learning frameworks. These include:

  1. AWS Deep Learning AMIs (Amazon Machine Images): These come with the latest versions of deep learning frameworks like TensorFlow, PyTorch, MXNet, and others, pre-installed on Ubuntu or Amazon Linux.

  2. Azure Machine Learning VMs: Azure provides VM images specifically tailored for machine learning, including tools and environments like PyTorch, TensorFlow, and others.

  3. Google Cloud AI Platform Deep Learning VM Image: This includes a comprehensive suite of machine learning and data science libraries, along with frameworks such as TensorFlow, PyTorch, Keras, and others.

  4. NVIDIA GPU Cloud (NGC): Although not a cloud service provider itself, NVIDIA offers a range of Docker images that are optimized for GPUs and can be used on the above cloud providers. These images include NVIDIA's CUDA toolkit, deep learning frameworks, and HPC applications.


1. Base OS Selection Strategy:

Aspect Detail
OS Choice Ubuntu LTS (18.04, 20.04, 22.04)
Considerations Long-term support, stability, compatibility with AI tools

Ubuntu was chosen for its widespread use in the AI research community, compatibility with most machine learning frameworks, and long-term support which is crucial for a stable research environment.


2. Customization Approaches:

Approach 1: Pre-Installed Package Image

Pro Con
Immediate readiness Limited flexibility for customized setups
Consistent environment across users Regular updates required to the image
Controlled versions of pre-installed software Larger image sizes

Approach 2: On-Demand Installation Image

Pro Con
Customizable and scalable Longer VM initialization time
Latest software versions available Additional complexity during deployment
Smaller base image sizes Dependency on external package repositories

3. Testing Tools and Requirements:

Tool Purpose
KVM & QEMU For creating and running virtual machines
Cloud-init To apply user-data for VM customization
Guestfs-tools For modifying and inspecting VM disks
Libvirt To manage virtual machines and networks
OpenStack For deploying in a cloud environment

GPU Passthrough Testing:

Requirement Challenge
Host Compatibility Ensuring the NVIDIA driver works across all supported host OS versions
VM Performance Verifying the VM harnesses the full potential of the GPU for AI/ML tasks
Stability Consistent GPU behavior during prolonged operations

4. Maintenance and Update Practices:

Current Scripts Maintenance

Scripting Description
Bash Scripts Used for basic automation of tasks including setup and updates
CLI Tools KVM, QEMU, and OpenStack command line interfaces for managing VM operations

Planned Improvement Using Advanced Tools

Tool Benefit
Packer For creating identical machine images for multiple platforms
Ansible For automating software provisioning, configuration management, and application deployment

5. Lessons Learned and Compatibility Challenges:

During our testing, we have learned that NVIDIA drivers included in a VM image may not be compatible across all hosts due to differences in the kernel or system libraries. For example, a VM image with NVIDIA drivers working on Ubuntu 22.04 host may fail on an Ubuntu 18.04 host due to driver incompatibility.


The first version of the guest OS base image focuses on providing a ready-to-deploy AI research environment with GPU support. A balance between out-of-the-box usability and customizability is essential. Continuous maintenance and regular updates are crucial to keep up with the latest AI frameworks and drivers.

Recommendations:


II. Data Science and Machine Learning Tools

Category Package Description
Data Manipulation NumPy Array and matrices processing tool.
Pandas Data structures and analysis tools.
Dask Scalable analytics in Python.
Machine Learning Scikit-learn Data mining and data analysis.
TensorFlow End-to-end open-source machine learning platform.
Keras High-level neural networks API.
PyTorch Machine learning library for tensor and neural network.
XGBoost Gradient boosting framework.
LightGBM Gradient boosting framework that uses tree-based learning.
CatBoost Gradient boosting on decision trees library.
Statistical Modeling Statsmodels Estimation of statistical models.
SciPy Scientific and technical computing.
Natural Language Processing NLTK Toolkit for natural language processing.
spaCy Library for advanced natural language processing.
gensim Library for unsupervised topic modeling and natural language processing.
Visualization Matplotlib 2D plotting library.
Seaborn Statistical data visualization based on Matplotlib.
Plotly Interactive graphing library.
Bokeh Interactive visualization library.
ggplot (for Python) Declarative data visualization based on ggplot2 for R.
Data Importing/Wrangling Beautiful Soup Library for pulling data out of HTML and XML files (web scraping).
requests HTTP library for making requests.
Scrapy Web crawling & scraping framework.
Development Tools Jupyter Notebook/Lab Interactive computing environment.
IPython Advanced interactive Python shell.
Data Storage/Big Data SQLAlchemy SQL toolkit and object-relational mapping for Python.
PyMongo MongoDB driver for Python.
pyspark Apache Spark Python API.
h5py Interface to HDF5 binary data format.
Deep Learning FastAI Simplifies training fast and accurate neural nets.
Time Series Analysis Prophet Forecasting time series data.
tsfresh Automatic extraction of features from time series.
Optimization and Operations Research PuLP Linear programming library.
Data Quality/Cleaning Great Expectations Tool for data quality and validation.

III.Tools Specific for Medical Imaging

Medical image analysis is a specialized field involving the examination of images from various medical imaging techniques such as X-rays, MRI, CT scans, ultrasound, and more. It requires specialized tools and libraries that can handle the unique formats and analysis needs associated with medical imagery.

These tools feature a range of functionalities required for medical image analysis, such as reading medical images in various formats, processing images, visualizing complex medical data, and performing complex tasks like segmentation, registration, and feature extraction. They are widely used by researchers and professionals in the field of medical imaging and computational radiology.

Tool/Library Language Description
MONAI (Medical Open Network for AI) Python Framework built on PyTorch for deep learning in healthcare imaging, part of the PyTorch Ecosystem.
Cytomine-python-client Python Python client for interacting with Cytomine, an open-source platform for managing and analyzing large-scale biomedical imaging data.
Tool/Library Language Description
ITK (Insight Segmentation and Registration Toolkit) C++ (with bindings for Python) Advanced toolkit for segmentation and registration.
VTK (The Visualization Toolkit) C++ (with bindings for Python) Software system for 3D computer graphics, image processing, and visualization.
SimpleITK C++ (with bindings for Python) Simplified layer built on top of ITK to facilitate its use in rapid prototyping, education, and research.
NIfTI Multiple Data format for storing neuroimaging data.
NiBabel Python Provides read/write access to NIfTI and other neuroimaging file formats.
PyDicom Python Pure Python package to work with DICOM files.
MedPy Python Library for medical image processing in Python with a focus on image classification.
MITK (Medical Imaging Interaction Toolkit) C++ Framework for developing interactive medical image processing software.
3D Slicer C++ (with Python scripting) Platform for medical image informatics, image processing, and three-dimensional visualization.

Additionally, some of these tools (like ITK, VTK, and 3D Slicer) can be integrated with AI and machine learning libraries (such as TensorFlow and PyTorch) to develop advanced applications for medical diagnosis, treatment planning, and surgical simulation. MONAI, in particular, is designed to facilitate the application of deep learning to medical research and clinical applications.