-
**Describe the bug**
We'd like to use FSDP when doing LoRA fine-tuning with large LLMs. We noticed in our experiments that the train loss when using FSDP vs. no FSDP is very different. With FSDP, …
zpx01 updated
6 months ago
-
I've been trying to think through backing up a cluster as a whole, and am curious if this is something the tool already accounts for that I'm just missing. We have a cluster of three clickhouse server…
-
Hi, thank you for your work! I'm currently trying to implement distributed training with InstructGPT and MiniGPT-4, using Hugging Face's accelerate library. However, I've encountered several problems …
-
In order for an ndarray/dataframe system to interact with a variety of frameworks in a distributed environment (such as clusters of workstations) a stable description of the distribution characteristi…
-
Let's use this issue to gather instructions on how to profile one's CPUNVMe setup.
(@tjruwase and I have been editing this post)
You need to do this on every new CPU/NVMe setup in order to confi…
-
**Related doc issues:** https://github.com/tarantool/doc/issues?q=is%3Aissue+is%3Aopen+config+label%3A3.0+label%3Aconfig+label%3Avshard
**Product:** Tarantool
**Since:** 3.0
**Root document:** ht…
-
### Feature Description
## Watermarking
Provide an option for specifying a time interval that enables sending periodic per-shard binlog events that indicates that all binlogs for that shard up to …
-
## Description
We found that our backup create_remote upload always failed, and log message is shown:
```
error metrics.ExecuteWithMetrics(create_remote) return error: b.getTablesForUploadDiffRemot…
-
pkg/ccl/logictestccl/tests/5node/5node_test.TestCCLLogic_partitioning_hash_sharded_index_query_plan [failed](https://tanzanite.cluster.engflow.com/invocations/default/cb776ceb-d010-496c-abe7-fb6410e6e…
-
In the [context of discussions](https://discord.com/channels/1110799176264056863/1166349141102837882/1166670331151388784) around named sharding and peer management for the Waku network, it is proposed…