issues
search
aws-samples
/
amazon-eks-machine-learning-with-terraform-and-kubeflow
Distributed training using Kubeflow on Amazon EKS
Apache License 2.0
78
stars
43
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
nemo-megatron container needs fixed version for transformers and datasets
#102
ajayvohra2005
closed
3 months ago
0
neuronx-nemo-megatron examples have runtime error saving checkpoint with save_xser=True
#101
ajayvohra2005
closed
3 months ago
0
Machine learning data process chart needs support for creating inline scripts
#100
ajayvohra2005
closed
4 months ago
0
Add Helm chart for databtricks-dolly-15k dataset
#99
ajayvohra2005
closed
4 months ago
0
Need Helm chart for Hugging Face model snapshot download
#98
ajayvohra2005
closed
4 months ago
0
Neuronx distributed Llama2 examples do not load latest checkpoint if it exists
#97
ajayvohra2005
closed
4 months ago
0
Red pajama dataset download link is defunct
#96
ajayvohra2005
closed
4 months ago
0
neuronx-distributed examples save up to last 10 checkpoints which consumes too much disk space
#95
ajayvohra2005
closed
4 months ago
0
neuronx-nemo-megatron examples need checkpointing enabled
#94
ajayvohra2005
closed
4 months ago
0
Nueronx distributed Llama2 7B PyTorch Lightning example has fatal error during checkpointing
#93
ajayvohra2005
opened
4 months ago
0
Can use the Kiali dashboard ?
#92
HanHoRang31
opened
5 months ago
1
Creating FSx for Lustre Data Repository Association: BadRequest: Amazon FSx is unable to validate access to the S3 bucket.
#91
HanHoRang31
closed
5 months ago
12
How do I set MAXPOD in EKS ?
#90
HanHoRang31
opened
5 months ago
6
Torch distributed RuntimeError: Socket Timeout
#89
ajayvohra2005
opened
5 months ago
1
Machine learning charts for training need to support dynamic EBS volume
#88
ajayvohra2005
closed
5 months ago
0
Data process machine learning charts need to support dynamic EBS volume
#87
ajayvohra2005
closed
5 months ago
0
Need EBS CSI driver storage class with volumeBindingMode WaitForFirstConsumer for EBS volume type gp3
#86
ajayvohra2005
closed
5 months ago
0
Katib UI is not detecting auth request header
#85
ajayvohra2005
closed
5 months ago
0
Allow FSx for Lustre file-system storage capacity to be configurable via Terraform variable
#84
ajayvohra2005
closed
5 months ago
1
Need a way to have Karpenter create single AZ GPU clusters when using EFA
#83
ajayvohra2005
closed
5 months ago
0
The manifest file eks-cluster/utils/attach-pvc.yaml should attach to both efs and fsx pvcs
#82
ajayvohra2005
closed
5 months ago
0
Git clone directory created by machine learning charts is not getting cleaned in case of failure
#81
ajayvohra2005
closed
5 months ago
0
Some training jobs require VPC CIDR Ingress in EKS cluster managed security group
#80
ajayvohra2005
closed
5 months ago
0
Helm chart pipeline step not does not complete when the job completes
#79
ajayvohra2005
closed
5 months ago
1
Helm charts pipeline does not show output of helm install command
#78
ajayvohra2005
closed
6 months ago
0
Trainium clusters need to be in a single subnet for EFA collective communications
#77
ajayvohra2005
closed
6 months ago
0
Helm chart kfp component does not need to include default values file
#76
ajayvohra2005
closed
6 months ago
0
In machine-learning charts, pre_script needs to execute after git clone
#75
ajayvohra2005
closed
6 months ago
1
MaskRCNN related helm charts need to be relocated
#74
ajayvohra2005
closed
6 months ago
0
FSx for Lustre automatic export to S3 is not configured correrctly
#73
ajayvohra2005
closed
6 months ago
0
Need to refactor kubeflow platform charts into a single sub-folder
#72
ajayvohra2005
closed
6 months ago
1
Need to add script for configuring S3 backend for Terraform state
#71
ajayvohra2005
closed
6 months ago
0
EFA plugin helm chart install values are incorrect
#70
ajayvohra2005
closed
6 months ago
0
build-ecr-images.sh script fails due to AWS login failure
#69
ajayvohra2005
closed
6 months ago
0
Need to refactor top-level container and container-optimized folders under a new top-level containers folder
#68
ajayvohra2005
closed
6 months ago
1
Remove unused files
#67
ajayvohra2005
closed
6 months ago
0
Need to add support for kubeflow components used in training
#66
ajayvohra2005
closed
7 months ago
0
Helm chart pv-fsx template YAML files do not explicitly reference storage class name
#65
ajayvohra2005
closed
7 months ago
0
EFS and Fsx for Luster PVC attach pods get stuck in Terminating state
#64
ajayvohra2005
closed
7 months ago
0
The version of aws-ia/eks-blueprints-addons/aws used is not fixed
#63
ajayvohra2005
closed
7 months ago
0
Karpenter module is being used without a version which is breaking the module
#62
ajayvohra2005
closed
7 months ago
0
Need to refactor Mask RCNN tutorial to separate training and testing Helm charts
#61
ajayvohra2005
closed
7 months ago
0
Refactor kubectl_manifest into helm charts
#60
ajayvohra2005
closed
7 months ago
0
Required file is in legacy folder
#59
ajayvohra2005
closed
8 months ago
0
Need to refactor karpenter and kubeflow components into separate helm charts
#58
ajayvohra2005
closed
8 months ago
0
Need to refactor terraform script into separate files
#57
ajayvohra2005
closed
8 months ago
0
Replace provisioner local-exec with kubectl_manifest or helm_release
#56
ajayvohra2005
closed
8 months ago
0
Add new terraform variables for system node group instance types and instance volume size
#55
ajayvohra2005
closed
8 months ago
0
Use Karpenter to manage accelerator nodes
#54
ajayvohra2005
closed
8 months ago
0
Need to add support for EC2 trn1 and inf2 instance types
#53
ajayvohra2005
closed
8 months ago
0
Next