NVIDIA / aistore

AIStore: scalable storage for AI applications
https://aistore.nvidia.com
MIT License
1.21k stars 160 forks source link

Extreme delays while reading objects for the first time #171

Closed satyatumati closed 6 months ago

satyatumati commented 6 months ago

Is there an existing issue for this?

Describe the bug

As seen in the benchmark statistics, for larger objects, aistore read takes a long delay. However, the writes are faster. The reads of cached objects are super fast too ~5ms. So not likely a network issue. The metrics are avg of 100 operations Are there any aistore metrics that we can monitor to understand this? We are using 0.96 version of the ais-k8s

Using the below for read and write.

bio = BytesIO()
client.bucket(bucket, provider="aws").object(path).get(bio)

bio = BytesIO(tensors.tobytes())
bio.seek(0)
client.bucket(bucket, provider="s3").object(path).put_content(bio)
Screen Shot 2024-02-13 at 5 16 03 PM Screen Shot 2024-02-13 at 5 15 57 PM

Expected Behavior

Read times < 1.5x

Current Behavior

Slow reads

Steps To Reproduce

read a large object from s3

Possible Solution

No response

Additional Information/Context

No response

AIStore build, Python SDK version or Docker image and build tag? (latest, v3.22, ...)

0.96 of the ais-k8s

Environment details (OS name and version, etc.)

k8s

alex-aizman commented 6 months ago
  1. First GET (we call it "cold GET") is a bit tricky. Use ais show performance to watch counters and latencies, see help for details. There's enough information to differentiate between a) reading it from remote and storing it locally vs b) entire end-to-end GET.
  2. https://github.com/NVIDIA/aistore/blob/main/ais/tgtfcold.go#L87
  3. There's a totally new development called "blob downloader". 1GB objects and greater. Will be part of the upcoming v3.22.
satyatumati commented 6 months ago

Thanks @alex-aizman , We are running 3.12 and dont have performance however we see this in the stats

satyat@vm-satyat:~$ ./ais cluster show stats
PROPERTY             VALUE
proxy.get.n          320575
proxy.get.ns             0
proxy.kalive.ns          1m30.949093416s
proxy.lst.ns             0
proxy.put.n          133853
proxy.up.ns.time         9d
target.append.ns         0
target.disk.sdb.avg.rsize    0
target.disk.sdb.avg.wsize    0
target.disk.sdb.read.bps     0
target.disk.sdb.util         0
target.disk.sdb.write.bps    0
target.dl.ns             0
target.dsort.creation.req.ns     0
target.dsort.creation.resp.ns    0
target.err.get.n         4619742
target.err.head.n        6564402
target.err.post.n        138
target.err.put.n         11112070
target.get.bps           0
target.get.cold.n        4562620
target.get.cold.size         6.14TiB
target.get.n             15682743
target.get.ns            9h29m54.645966205s
target.get.redir.ns      13h7m3.028661392s
target.kalive.ns         2m38.09997303s
target.lst.n             16979
target.lst.ns            1m23.570495732s
target.put.n             6548666
target.put.ns            6h23m32.128035379s
target.put.redir.ns      16h48m18.378593527s
target.streams.in.obj.n      148843
target.streams.in.obj.size   6.72GiB
target.streams.out.obj.n     166207
target.streams.out.obj.size  7.51GiB
alex-aizman commented 6 months ago

Well, that doesn't help. In any event, 3.12 is too old and would normally require a paid-for support option. Which is outside the scope - here we talk development, main branch and the previous release.

Closing.