Demo: First version guest OS base image for SDS 2.0

Report: Development of a Guest OS Base Image for SDS 2.0 with GPU Support

In response to the creation of a guest OS base image for GPU-enabled VMs for AI research on SDS 2.0, I have conducted thorough research and development to establish a first version base image that ensures stability, usability, and security. This image is tailored for AI researchers who require root access on private VMs. Our recommended base OS is Ubuntu Cloud Image in its 18.04, 20.04, and 22.04 LTS versions.

I. OS base image

What are the other platforms doing?

Cloud platforms such as AWS (Amazon Web Services), Azure, and GCE (Google Compute Engine) provide a wide range of base OS images that users can use to launch virtual machines (VMs). These images are suitable for various applications, including general-purpose computing and specialized tasks like deep learning. Here's a summary of OS images you can find on these platforms:

General-Purpose Base OS Images:

Ubuntu (18.04, 20.04, 22.04 LTS versions are common).
Red Hat Enterprise Linux (RHEL).
CentOS (though CentOS 8 reached end-of-life, CentOS Stream is available).
Amazon Linux (Amazon Linux 2 is commonly used on AWS).
Debian (various stable versions).
Fedora.
SUSE Linux Enterprise Server (SLES).
Windows Server (various versions, like 2019, 2022).
Oracle Linux.

Base OS Images for Deep Learning and Machine Learning:

Cloud providers offer specialized images that come with data science and machine learning tools pre-installed, including deep learning frameworks. These include:

AWS Deep Learning AMIs (Amazon Machine Images): These come with the latest versions of deep learning frameworks like TensorFlow, PyTorch, MXNet, and others, pre-installed on Ubuntu or Amazon Linux.
Azure Machine Learning VMs: Azure provides VM images specifically tailored for machine learning, including tools and environments like PyTorch, TensorFlow, and others.
Google Cloud AI Platform Deep Learning VM Image: This includes a comprehensive suite of machine learning and data science libraries, along with frameworks such as TensorFlow, PyTorch, Keras, and others.
NVIDIA GPU Cloud (NGC): Although not a cloud service provider itself, NVIDIA offers a range of Docker images that are optimized for GPUs and can be used on the above cloud providers. These images include NVIDIA's CUDA toolkit, deep learning frameworks, and HPC applications.

1. Base OS Selection Strategy:

Aspect	Detail
OS Choice	Ubuntu LTS (18.04, 20.04, 22.04)
Considerations	Long-term support, stability, compatibility with AI tools

Ubuntu was chosen for its widespread use in the AI research community, compatibility with most machine learning frameworks, and long-term support which is crucial for a stable research environment.

2. Customization Approaches:

Approach 1: Pre-Installed Package Image

Pro	Con
Immediate readiness	Limited flexibility for customized setups
Consistent environment across users	Regular updates required to the image
Controlled versions of pre-installed software	Larger image sizes

Approach 2: On-Demand Installation Image

Pro	Con
Customizable and scalable	Longer VM initialization time
Latest software versions available	Additional complexity during deployment
Smaller base image sizes	Dependency on external package repositories

3. Testing Tools and Requirements:

Tool	Purpose
KVM & QEMU	For creating and running virtual machines
Cloud-init	To apply user-data for VM customization
Guestfs-tools	For modifying and inspecting VM disks
Libvirt	To manage virtual machines and networks
OpenStack	For deploying in a cloud environment

GPU Passthrough Testing:

Requirement	Challenge
Host Compatibility	Ensuring the NVIDIA driver works across all supported host OS versions
VM Performance	Verifying the VM harnesses the full potential of the GPU for AI/ML tasks
Stability	Consistent GPU behavior during prolonged operations

4. Maintenance and Update Practices:

Current Scripts Maintenance

Scripting	Description
Bash Scripts	Used for basic automation of tasks including setup and updates
CLI Tools	KVM, QEMU, and OpenStack command line interfaces for managing VM operations

Planned Improvement Using Advanced Tools

Tool	Benefit
Packer	For creating identical machine images for multiple platforms
Ansible	For automating software provisioning, configuration management, and application deployment

5. Lessons Learned and Compatibility Challenges:

During our testing, we have learned that NVIDIA drivers included in a VM image may not be compatible across all hosts due to differences in the kernel or system libraries. For example, a VM image with NVIDIA drivers working on Ubuntu 22.04 host may fail on an Ubuntu 18.04 host due to driver incompatibility.

The first version of the guest OS base image focuses on providing a ready-to-deploy AI research environment with GPU support. A balance between out-of-the-box usability and customizability is essential. Continuous maintenance and regular updates are crucial to keep up with the latest AI frameworks and drivers.

Recommendations:

Rigorously document and standardize both customization approaches to make informed decisions based on specific use cases.
Develop and employ automation scripts, potentially adopting infrastructure as code practices.
Diligently test GPU pass-through on VMs, focusing on compatibility across different host OS versions.
Plan for the introduction of Packer and Ansible for advanced and efficient image building and maintenance.

II. Data Science and Machine Learning Tools

Category	Package	Description
Data Manipulation	NumPy	Array and matrices processing tool.
	Pandas	Data structures and analysis tools.
	Dask	Scalable analytics in Python.
Machine Learning	Scikit-learn	Data mining and data analysis.
	TensorFlow	End-to-end open-source machine learning platform.
	Keras	High-level neural networks API.
	PyTorch	Machine learning library for tensor and neural network.
	XGBoost	Gradient boosting framework.
	LightGBM	Gradient boosting framework that uses tree-based learning.
	CatBoost	Gradient boosting on decision trees library.
Statistical Modeling	Statsmodels	Estimation of statistical models.
	SciPy	Scientific and technical computing.
Natural Language Processing	NLTK	Toolkit for natural language processing.
	spaCy	Library for advanced natural language processing.
	gensim	Library for unsupervised topic modeling and natural language processing.
Visualization	Matplotlib	2D plotting library.
	Seaborn	Statistical data visualization based on Matplotlib.
	Plotly	Interactive graphing library.
	Bokeh	Interactive visualization library.
	ggplot (for Python)	Declarative data visualization based on ggplot2 for R.
Data Importing/Wrangling	Beautiful Soup	Library for pulling data out of HTML and XML files (web scraping).
	requests	HTTP library for making requests.
	Scrapy	Web crawling & scraping framework.
Development Tools	Jupyter Notebook/Lab	Interactive computing environment.
	IPython	Advanced interactive Python shell.
Data Storage/Big Data	SQLAlchemy	SQL toolkit and object-relational mapping for Python.
	PyMongo	MongoDB driver for Python.
	pyspark	Apache Spark Python API.
	h5py	Interface to HDF5 binary data format.
Deep Learning	FastAI	Simplifies training fast and accurate neural nets.
Time Series Analysis	Prophet	Forecasting time series data.
	tsfresh	Automatic extraction of features from time series.
Optimization and Operations Research	PuLP	Linear programming library.
Data Quality/Cleaning	Great Expectations	Tool for data quality and validation.

III.Tools Specific for Medical Imaging

Medical image analysis is a specialized field involving the examination of images from various medical imaging techniques such as X-rays, MRI, CT scans, ultrasound, and more. It requires specialized tools and libraries that can handle the unique formats and analysis needs associated with medical imagery.

These tools feature a range of functionalities required for medical image analysis, such as reading medical images in various formats, processing images, visualizing complex medical data, and performing complex tasks like segmentation, registration, and feature extraction. They are widely used by researchers and professionals in the field of medical imaging and computational radiology.

Tool/Library	Language	Description
MONAI (Medical Open Network for AI)	Python	Framework built on PyTorch for deep learning in healthcare imaging, part of the PyTorch Ecosystem.
Cytomine-python-client	Python	Python client for interacting with Cytomine, an open-source platform for managing and analyzing large-scale biomedical imaging data.

Tool/Library	Language	Description
ITK (Insight Segmentation and Registration Toolkit)	C++ (with bindings for Python)	Advanced toolkit for segmentation and registration.
VTK (The Visualization Toolkit)	C++ (with bindings for Python)	Software system for 3D computer graphics, image processing, and visualization.
SimpleITK	C++ (with bindings for Python)	Simplified layer built on top of ITK to facilitate its use in rapid prototyping, education, and research.
NIfTI	Multiple	Data format for storing neuroimaging data.
NiBabel	Python	Provides read/write access to NIfTI and other neuroimaging file formats.
PyDicom	Python	Pure Python package to work with DICOM files.
MedPy	Python	Library for medical image processing in Python with a focus on image classification.
MITK (Medical Imaging Interaction Toolkit)	C++	Framework for developing interactive medical image processing software.
3D Slicer	C++ (with Python scripting)	Platform for medical image informatics, image processing, and three-dimensional visualization.

Additionally, some of these tools (like ITK, VTK, and 3D Slicer) can be integrated with AI and machine learning libraries (such as TensorFlow and PyTorch) to develop advanced applications for medical diagnosis, treatment planning, and surgical simulation. MONAI, in particular, is designed to facilitate the application of deep learning to medical research and clinical applications.

NBISweden / aida-data-hub