I tried to use torchsnapshot to save checkpoints of the model in torchrec implementation. I made following changes in the dlrm_main.py for the purpose.
for batched_iterator in batched(iterator, n):
for it in itertools.count(start_it):
try:
if is_rank_zero and print_lr:
for i, g in enumerate(pipeline._optimizer.param_groups):
print(f"lr: {it} {i} {g['lr']:.6f}")
pipeline.progress(batched_iterator)
lr_scheduler.step()
if is_rank_zero:
pbar.update(1)
snapshot = torchsnapshot.Snapshot.take(path="embedding_shards",
app_state=app_state,
replicated=["**"],
)
except StopIteration:
if is_rank_zero:
print("Total number of iterations:", it)
start_it = it
break
I did get some weights saved in the embedding_shards directory however I am not sure how to interpret the saved directory.
In the directory embedding_shards, I see two directories - batched and sharded.
batched has 8 files (names are uuids)- a total of size 196 GB
I am not sure how to interpret the saved directory and how to read all the embedding tables from the output shown above. Is there a way to gather all the weights on CPU and then dump or extract embedding tables, sharded layer weights, gather those and then dump the tables on the host.
apologies, can you put the issue here instead? https://github.com/mlcommons/training/issues. We'll respond there. The code is slightly different here and in the mlcommons repo.
I tried to use torchsnapshot to save checkpoints of the model in torchrec implementation. I made following changes in the dlrm_main.py for the purpose.
I did get some weights saved in the embedding_shards directory however I am not sure how to interpret the saved directory.
In the directory embedding_shards, I see two directories - batched and sharded. batched has 8 files (names are uuids)- a total of size 196 GB
sharded has following files with a total size of 98 GB: model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_10000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_11048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_12097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_13145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_14194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_15000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_16048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_17097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_18145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_19194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_20000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_21048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_22097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_23145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_24194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_25000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_26048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_27097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_28145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_29194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_30000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_31048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_3145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_32097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_33145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_34194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_35000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_36048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_37097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_38145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_39194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_4194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_5000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_6048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_7097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_8145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_9194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_10.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_10.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_10.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_11.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_10000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_11048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_12097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_13145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_14194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_15000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_16048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_17097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_18145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_19194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_20000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_21048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_22097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_23145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_24194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_25000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_26048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_27097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_28145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_29194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_30000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_31048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_3145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_32097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_33145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_34194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_35000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_36048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_37097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_38145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_39194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_4194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_5000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_6048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_7097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_8145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_9194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_10000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_11048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_12097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_13145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_14194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_15000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_16048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_17097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_18145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_19194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_20000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_21048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_22097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_23145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_24194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_25000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_26048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_27097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_28145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_29194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_30000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_31048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_3145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_32097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_33145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_34194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_35000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_36048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_37097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_38145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_39194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_4194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_5000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_6048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_7097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_8145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_9194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_10000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_11048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_12097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_13145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_14194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_15000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_16048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_17097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_18145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_19194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_20000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_21048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_22097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_23145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_24194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_25000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_26048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_27097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_28145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_29194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_30000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_31048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_3145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_32097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_33145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_34194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_35000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_36048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_37097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_38145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_39194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_4194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_5000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_6048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_7097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_8145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_9194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_22.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_10000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_11048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_12097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_13145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_14194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_15000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_16048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_17097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_18145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_19194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_20000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_21048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_22097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_23145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_24194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_25000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_26048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_27097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_28145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_29194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_30000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_31048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_3145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_32097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_33145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_34194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_35000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_36048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_37097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_38145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_39194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_4194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_5000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_6048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_7097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_8145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_9194304_0
I am not sure how to interpret the saved directory and how to read all the embedding tables from the output shown above. Is there a way to gather all the weights on CPU and then dump or extract embedding tables, sharded layer weights, gather those and then dump the tables on the host.