issues
search
IBM
/
autopilot
A tool to detect infrastructure issues on cloud native AI systems
Apache License 2.0
16
stars
13
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Install refactor and ping-iperf bugfix
#54
cmisale
closed
3 days ago
1
Update HEALTH_CHECKS.md
#53
eburhansjah
closed
2 weeks ago
0
[Feature] TODO list for 2.0 release
#52
cmisale
opened
4 weeks ago
0
Update Version and README to Fix Current Release
#51
cmisale
closed
4 weeks ago
0
Fix issue #49: Updated command in Step 2 of Install section in README
#50
Anish701
closed
1 month ago
0
[Bug] Error in Step 2 of Install section in README
#49
Anish701
closed
1 month ago
1
[Bug] Install fails with --namespace helm install parameter if namespace doesn't exist
#48
cmisale
opened
1 month ago
2
[Feature] Refactor Network Workload Module
#47
Vezio
opened
1 month ago
0
[Feature] Enhance Error Handling in Network Tests
#46
Vezio
opened
1 month ago
0
[Templates] Adds Git Issue and Pull Request Templates
#45
Vezio
closed
1 month ago
0
Major update to alert manager
#44
cmisale
closed
1 month ago
0
Network Performance and Validation Test
#43
Vezio
closed
1 month ago
0
Adds TESTING node label during invasive checks
#42
jimcadden
closed
1 month ago
0
New node label for reserving nodes
#41
cmisale
closed
1 month ago
1
[Network] Performance Workload and Stress Testing
#40
Vezio
closed
1 month ago
0
Add probes for liveness and readiness
#39
cmisale
closed
2 months ago
0
Node label add and ping gauge update
#38
cmisale
closed
2 months ago
0
bump up release minor version
#37
cmisale
closed
3 months ago
0
Node Labeling for General GPU Health
#36
cmisale
closed
3 months ago
0
Enabling GPU-less Autopilot
#35
cmisale
closed
3 months ago
0
Prepare for release v1.7.0
#34
cmisale
closed
4 months ago
0
Ping and PVC patch
#33
cmisale
closed
4 months ago
0
bugfix: ping test
#32
cmisale
closed
4 months ago
0
PVC create-delete health check
#31
cmisale
closed
5 months ago
0
Fix a typo.
#30
egallen
closed
5 months ago
0
[DO NOT MERGE] NCCL test support
#29
jimcadden
opened
5 months ago
0
Entrypoint for smoke tests
#28
cmisale
closed
2 months ago
1
Fix to CPU model parsing
#27
cmisale
closed
6 months ago
1
string parse bugfix
#26
cmisale
closed
6 months ago
2
Chart release patch
#25
cmisale
closed
6 months ago
1
[DO NOT MERGE] Resource functions and handler for NCCL tests
#24
jimcadden
closed
5 months ago
1
Bump golang.org/x/net from 0.19.0 to 0.23.0 in /autopilot-daemon
#23
dependabot[bot]
closed
6 months ago
0
add CPU & GPU data columns to Prometheus
#22
remolina
closed
6 months ago
8
Minor updates to Github Actions workflows
#21
cmisale
closed
6 months ago
0
killall5 feature for nvidia-smi failures
#20
cmisale
closed
6 months ago
0
export metrics produced by dcgm pods
#19
cmisale
closed
6 months ago
0
patch dcgm json parsing
#18
cmisale
closed
7 months ago
0
updates to helm for new release
#17
cmisale
closed
7 months ago
0
integration with previous PR and some python3.8 adjustments
#16
cmisale
closed
7 months ago
0
optionally observe only a subset of dcgm test results based on env variable
#15
lasch
closed
7 months ago
0
Bump google.golang.org/protobuf from 1.31.0 to 1.33.0 in /autopilot-daemon
#14
dependabot[bot]
closed
7 months ago
0
Step missing from the Helm instructions?
#13
jimcadden
closed
7 months ago
2
Bump google.golang.org/protobuf from 1.30.0 to 1.33.0 in /autopilot-daemon
#12
dependabot[bot]
closed
7 months ago
0
Is the `all` target in `runAllTestsLocal` accurate and/or necessary?
#11
jimcadden
closed
6 months ago
3
Configure periodic checks via env variable
#10
jimcadden
closed
7 months ago
1
Make periodic checks configurable
#9
jimcadden
closed
7 months ago
11
First Invasive check through external K8s Job - dcgmi
#8
cmisale
closed
7 months ago
1
CPU & GPU data columns in the Prometheus
#7
jimcadden
closed
6 months ago
14
Module update and cleanup
#6
cmisale
closed
8 months ago
0
Operator deploy
#5
jimcadden
opened
8 months ago
0
Next