NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.27k stars 173 forks source link

Restore data after disk impact #188

Closed quasar5935 closed 2 weeks ago

quasar5935 commented 1 month ago

Is there an existing issue for this?

Describe the bug

Validate/rebalance procedure do not restore ec blocks.

Expected Behavior

Validate procedure will restore ec and mt blocks.

Current Behavior

I reproduced disk loss data. Validate procedure do not restore ec and mt.

Steps To Reproduce

  1. Create new bucket ais create ais://kpTest2

  2. Check bucket properties

    ais bucket props show ais://kpTest2
    PROPERTY             VALUE
    access               GET,HEAD-OBJECT,PUT,APPEND,DELETE-OBJECT,MOVE-OBJECT,PROMOTE,UPDATE-OBJECT,HEAD-BUCKET,LIST-OBJECTS,PATCH,SET-BUCKET-ACL,LIST-BUCKETS,SHOW-CLUSTER,CREATE-BUCKET,DESTROY-BUCKET,MOVE-BUCKET,ADMIN
    backend_bck.name         
    backend_bck.provider         
    checksum.enable_read_range   false
    checksum.type            md5
    checksum.validate_cold_get   true
    checksum.validate_obj_move   false
    checksum.validate_warm_get   false
    created              2024-09-15T20:01:03+03:00
    ec.bundle_multiplier         0
    ec.compression           never
    ec.data_slices           2
    ec.disk_only             false
    ec.enabled           true
    ec.objsize_limit         262144
    ec.parity_slices         1
    features             none
    lru.capacity_upd_time        10m
    lru.dont_evict_time      2h0m
    lru.enabled          false
    mirror.burst_buffer      1024
    mirror.copies            2
    mirror.enabled           false
    present              yes
    provider             ais
    versioning.enabled       true
    versioning.synchronize       false
    versioning.validate_warm_get     false
    write_policy.data        immediate
    write_policy.md          immediate
  3. Create file dd if=/dev/urandom of=big1 bs=2062144 count=1

  4. Upload file ais object put big1 ais://kpTest2/

  5. Check file properties

    ais object show --all ais://kpTest2/big1
    PROPERTY     VALUE
    atime        15 Sep 24 20:11 MSK
    checksum     md5[0f0f666688326c94...]
    copies       1 [/ais/sda]
    custom       -
    ec       2:1 (gen 16078549426820157)[encoded]
    location     t[ais-dev-target-3]:mp[/ais/sda, fs=/dev/drbd1300, "/dev/drbd1300"]
    name         ais://kpTest2/big1
    size         1.97MiB
    version      1
  6. Find ec on another target

    
    [root@ais-dev-target-2 kpTest2]# ls -la /ais/sda/@ais/kpTest2/%*/
    /ais/sda/@ais/kpTest2/%ds/:
    total 0
    drwxr-x--- 2 root root  6 Sep 15 20:01 .
    drwxr-x--- 8 root root 72 Sep 15 20:01 ..

/ais/sda/@ais/kpTest2/%dw/: total 0 drwxr-x--- 2 root root 6 Sep 15 20:01 . drwxr-x--- 8 root root 72 Sep 15 20:01 ..

/ais/sda/@ais/kpTest2/%ec/: total 10804 drwxr-x--- 2 root root 36 Sep 15 20:11 . drwxr-x--- 8 root root 72 Sep 15 20:01 .. -rw-r----- 1 root root 1031072 Sep 15 20:11 big1 -rw-r----- 1 root root 10031072 Sep 15 20:11 illionora1

/ais/sda/@ais/kpTest2/%mt/: total 12 drwxr-x--- 2 root root 57 Sep 15 20:11 . drwxr-x--- 8 root root 72 Sep 15 20:01 .. -rw-r----- 1 root root 198 Sep 15 20:11 4ggillionora1 -rw-r----- 1 root root 231 Sep 15 20:11 big1 -rw-r----- 1 root root 231 Sep 15 20:11 illionora1

/ais/sda/@ais/kpTest2/%ob/: total 4194304 drwxr-x--- 2 root root 27 Sep 15 20:11 . drwxr-x--- 8 root root 72 Sep 15 20:01 .. -rw-r----- 1 root root 4294967296 Sep 15 20:11 4ggillionora1

/ais/sda/@ais/kpTest2/%wk/: total 0 drwxr-x--- 2 root root 6 Sep 15 20:11 . drwxr-x--- 8 root root 72 Sep 15 20:01 ..


7. Delete ec and mt from target above
`[root@ais-dev-target-2 kpTest2] rm -f /ais/sda/@ais/kpTest2/%ec/big1 && rm -f /ais/sda/@ais/kpTest2/%mt/big1`

8. Run validate

[root@ais-dev-admin-9k6lf downloads]# ais storage validate ais://kpTest2 BUCKET OBJECTS MISPLACED MISSING COPIES ais://kpTest2 3 0 0


9. Check ec, mt - not exists for file big1

[root@ais-dev-target-2 kpTest2]# ls -la /ais/sda/@ais/kpTest2/%*/ /ais/sda/@ais/kpTest2/%ds/: total 0 drwxr-x--- 2 root root 6 Sep 15 20:01 . drwxr-x--- 8 root root 72 Sep 15 20:01 ..

/ais/sda/@ais/kpTest2/%dw/: total 0 drwxr-x--- 2 root root 6 Sep 15 20:01 . drwxr-x--- 8 root root 72 Sep 15 20:01 ..

/ais/sda/@ais/kpTest2/%ec/: total 10804 drwxr-x--- 2 root root 36 Sep 15 20:11 . drwxr-x--- 8 root root 72 Sep 15 20:01 .. -rw-r----- 1 root root 10031072 Sep 15 20:11 illionora1

/ais/sda/@ais/kpTest2/%mt/: total 12 drwxr-x--- 2 root root 57 Sep 15 20:11 . drwxr-x--- 8 root root 72 Sep 15 20:01 .. -rw-r----- 1 root root 198 Sep 15 20:11 4ggillionora1 -rw-r----- 1 root root 231 Sep 15 20:11 illionora1

/ais/sda/@ais/kpTest2/%ob/: total 4194304 drwxr-x--- 2 root root 27 Sep 15 20:11 . drwxr-x--- 8 root root 72 Sep 15 20:01 .. -rw-r----- 1 root root 4294967296 Sep 15 20:11 4ggillionora1

/ais/sda/@ais/kpTest2/%wk/: total 0 drwxr-x--- 2 root root 6 Sep 15 20:11 . drwxr-x--- 8 root root 72 Sep 15 20:01 ..


10. Try rebalance
`ais cluster rebalance start
Started global rebalance. To monitor the progress, run 'ais show rebalance'`

11. Rebalance is finished, ec and mt of big1 does not appear..

[root@ais-dev-admin-9k6lf ois]# ais show rebalance --all REB ID NODE OBJECTS RECV SIZE RECV OBJECTS SENT SIZE SENT START END STATE g41 ais-dev-target-0 0 0B 4 38.27MiB 21:55:40 21:56:06 Finished g41 ais-dev-target-1 1 0B 0 0B 21:55:40 21:56:09 Finished g41 ais-dev-target-2 2 144B 0 0B 21:55:40 21:56:08 Finished g41 ais-dev-target-3 2 19.13MiB 3 0B 21:55:40 21:56:09 Finished g41 ais-dev-target-4 1 0B 2 288B 21:55:40 21:56:09 Finished

g40 ais-dev-target-1 0 0B 0 0B 21:09:53 21:13:36 Finished

g38 ais-dev-target-0 0 0B 4 38.27MiB 20:49:18 20:49:44 Finished g38 ais-dev-target-2 2 144B 0 0B 20:49:18 20:49:46 Finished g38 ais-dev-target-3 2 19.13MiB 3 0B 20:49:18 20:49:47 Finished g38 ais-dev-target-4 1 0B 2 288B 20:49:18 20:49:47 Finished

g37 ais-dev-target-0 2 19.13MiB 4 19.13MiB 20:47:25 20:47:54 Finished g37 ais-dev-target-2 3 144B 0 0B 20:47:25 20:47:54 Finished g37 ais-dev-target-3 0 0B 4 38.27MiB 20:47:25 20:47:52 Finished g37 ais-dev-target-4 1 9.57MiB 4 288B 20:47:25 20:47:54 Finished

g36 ais-dev-target-3 0 0B 0 0B 20:40:38 20:44:20 Finished

g35 ais-dev-target-0 0 0B 0 0B 13:03:25 13:03:51 Finished g35 ais-dev-target-2 1 144B 0 0B 13:03:25 13:03:53 Finished g35 ais-dev-target-4 0 0B 2 288B 13:03:25 13:03:52 Finished

g34 ais-dev-target-0 0 0B 0 0B 13:02:07 13:02:33 Finished g34 ais-dev-target-2 1 144B 0 0B 13:02:07 13:02:36 Finished g34 ais-dev-target-4 0 0B 2 288B 13:02:07 13:02:34 Finished

g33 ais-dev-target-0 0 0B 0 0B 13:00:03 13:00:29 Finished g33 ais-dev-target-2 0 0B 2 288B 13:00:03 13:00:30 Finished g33 ais-dev-target-4 1 144B 0 0B 13:00:03 13:00:32 Finished

g31 ais-dev-target-0 0 0B 0 0B 00:28:45 00:29:12 Finished g31 ais-dev-target-2 0 0B 0 0B 00:28:45 00:29:12 Finished g31 ais-dev-target-4 0 0B 0 0B 00:28:45 00:29:12 Finished

g27 ais-dev-target-4 0 0B 0 0B 00:24:00 00:24:26 Finished



### Possible Solution

_No response_

### Additional Information/Context

_No response_

### AIStore build/version

3.23

### Environment details (OS name and version, etc.)

alma9.2, k8s
alex-aizman commented 1 month ago

Short answer: this is a known problem. Or, rather, a piece of missing functionality. Haven't had time to fix it yet.

Unrelated notes and tips:

alex-aizman commented 3 weeks ago

related: d2feb38cfda221a10abaec90453e5eea6c8b591c

alex-aizman commented 2 weeks ago

in progress: b4666d34154a9b75fedd3acd02228890dceccbb9

alex-aizman commented 2 weeks ago

fixed by be7185ac101b40db9034c8850858ff08f14ba4e1 sequence; closing