-
Hi,
I am trying to run the example script provided for llama model for inference only. Since the repository is going through migration and a lot of changes, I went back and install the stable `v0.2…
-
Apologies if this has already been requested, or is clearly impossible for some reason. My Dask knowledge isn't super deep.
I know that OSErrors, which can occur due to a disk being full, are handl…
zmbc updated
8 months ago
-
At some point the tipping point will be passed and talented open source hackers will be able to solve costly problems for orgs without having to take employment there. Something is going to reduce the…
-
I used commit 05ea5a6929ccdf28870591b56aac81795203fc23 to install Julia 1.0.0 and TensorFlow in a Docker container. Then I tried running the "Logistic Regression" example in the resulting terminal.
…
-
The `__hash__` of a `WorkerState` object is just its address: https://github.com/dask/distributed/blob/33fc50ca9817216bb4105b68f5e0859ebfb80fdb/distributed/scheduler.py#L480
As is the equality chec…
-
Dear Authors,
Thanks for the amazing work!
when I run:
`torchrun --nproc_per_node=4 --master_port 4321 train.py gpus=[0] num_workers=4 name=BP_KITTI net=PMP data=KITTI lr=1e-3 train_batch_size=2 te…
-
### Description
Datafusion (https://arrow.apache.org/datafusion/) is a modular query engine developed as a subproject of Apache Arrow.
It is written in Rust (providing very high performance), h…
-
@mppf
What should the pattern be for sorting distributed arrays? One way that I can think of is to sort the portion of the array corresponding to each locale's local subdomains and then perform a …
-
程序跑完1个epoch之后,在第二轮训练过程中卡住,超时报错了
请问这个问题大概出现在哪里?
[2024-05-09 01:12:34 accelerate.tracking]: Successfully logged to TensorBoard
[rank3]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog caught collective…
-
Hello,
While training stage to network, im seeing the following error.
Is anyone seeing the same error?
Traceback (most recent call last):
File "./tools/train.py", line 256, in
main…