CIROH-UA / NGIAB-CloudInfra

NextGen In A Box: NextGen Generation Water Modeling Framework for Community Release (Docker version)
https://docs.ciroh.org/docs/products/nextgeninaboxDocker/
11 stars 16 forks source link

GPU Cluster - Wukong and Pantarhei #110

Closed arpita0911patel closed 6 months ago

arpita0911patel commented 6 months ago

Moving ticket from https://github.com/AlabamaWaterInstitute/CloudInfra/issues/34

  1. Requester Information: Chaopeng Shen, cxs1024@psu.edu

  2. Project Information: Phase 1: A. Improving the integration of ML with physically-based hydrologic and routing modeling via large-scale parameter and structure learning schemes Prospective phase 2 projects: B. Developing and benchmarking data assimilation methods on a standardized testbed C. Pathways to using multi-model mosaics for operational hydrologic prediction D. CIROH: ML-based Flexible Flood Inundation Mapping and Intercomparison Framework Currently, three students can use the system. At this time next year, maybe 4-5 students from PSU can be using these resources if they are available. We of course also have local resources so it's not all on the UA side.

Provide a brief description of the project and its goals. This can help the infrastructure team understand the context and purpose of the requested resources. All projects involve training large-scale machine learning models on GPUs and CPUs. 3 projects (A,B,C) have similar data flow characteristics (high throughput GPU jobs). One project (D)'s requirement will be different as it is more likely solving PDEs on the GPU.

  1. Resource Requirements: Specify the compute, storage, and network resources needed for the project. Be as specific as possible about the number of resources required, and any specific configurations or capabilities needed. This information will help the infrastructure team determine the appropriate resources to allocate. CPU core:GPU ratio>=4:1 is preferred so we maximize the GPU efficiency (CPU is feeding data and addressing framework overhead). A100s is the preferred GPU (80 GB or 40 GB, 80 GB is preferred but they are pricey). I describe the growth stages in a box below,

Storage: a little hard to predict now, we can start thinking about a standard compute node with 2 TB system drive and 30 TB storage.

If not too many people are using the GPUs, a queue-less system is preferred. I understand the need for the queue as the number of users grow.

The above is based on my group's use.

Options:

EC2 S3 – public, private, requester pay, bucket name suggestion? EBS (Amazon Elastic Block Store) EFS RDS VPC (Virtual Private Cloud) DynamoDB ECS EKS (Kubernetes Cluster) Lambda Others: please list I provide a quote I download from Lambda. They are a bit pricey -- there are cheaper vendors. I also don't mean we need to get these specifications. I just use it a starting point for discussion. https://pennstateoffice365-my.sharepoint.com/:b:/g/personal/cxs1024_psu_edu/Eb0Tf_tu6r1CsKxq2oP6Zz4BGVyszjz0OTUCyOCLopF8vg?e=RQPzGR

  1. Timeline: Indicate the expected timeline for the project and when the resources will be needed. This information can help the infrastructure team plan and allocate resources accordingly. The point is we can grow the system gradually and adjust to demand, but the systems need to be future proof (primarily about the networking and GPUs). In the very near term (0-3 months), our jobs will be run either independently on each GPU or (more likely) distributed data parallel, so networking is not super important. A 4-A100 system could help out significantly. We would appreciate if there is NvLink between GPUs allowing some parallelism. If the 4 A100s can be made available to us, it can already go a long way. I suggest going that way and see how we adapt to it, and see where needs are down the road. In the mid-term (3 months~1 year), there is a decent likelihood we can make use of 8-GPU node, but let's wait for confirmation. As new people come into the project, there will also be more demand just to run more jobs. In the long-term (>1 year), I see a future where we will be implementing model parallelism (different parts of the model resides on different GPUs) on a few dozens of GPUs for large jobs. This would require fast interconnect: (i) NvLink between GPUs and (ii) faster connect (Infiniband). This system can be grown gradually ---- we can add nodes as demand rises, but we need the networking components to be in. This is an emergent need and the future demand is a bit uncertain now, so it is recommended we take a "low-regret" approach to be future proof yet not invest too much at once. To be future-proof an ideal system include 8-GPU nodes with nvlink and infiniband ready so some nodes can be connected later. The cost of nvlink and infiniband are not excessive so some planning ahead can allow the system to be grown on.

  2. Security and Compliance Requirements: If there are any specific security or compliance requirements for the project, these should be clearly stated in the request. This will help ensure that the necessary security measures are in place for the project. N/A

  3. Budget: Include any budget constraints or requirements for the project. This will help the infrastructure team select the most cost-effective solutions for the project.

I did not budget computing cost in my budget.

  1. Approval: Indicate the necessary approval processes or sign-offs required for the request.
TrupeshKumarPatel commented 6 months ago

The Wukong cluster is ready to use. Accesses are provided to each user through their affiliated emails. Created accouts for this project:

  1. Dr. Chaopeng Shen
  2. Tadd Bindas
arpita0911patel commented 6 months ago

Thank you Trupesh. Closing this ticket as Wukong is configured and functional. Access is granted.