facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)
MIT License
3.71k stars 825 forks source link

saving checkpoints using Torchsnapshot #346

Closed mailvijayasingh closed 11 months ago

mailvijayasingh commented 1 year ago

I tried to use torchsnapshot to save checkpoints of the model in torchrec implementation. I made following changes in the dlrm_main.py for the purpose.

for batched_iterator in batched(iterator, n):
        for it in itertools.count(start_it):
            try:
                if is_rank_zero and print_lr:
                    for i, g in enumerate(pipeline._optimizer.param_groups):
                        print(f"lr: {it} {i} {g['lr']:.6f}")
                pipeline.progress(batched_iterator)
                lr_scheduler.step()
                if is_rank_zero:
                    pbar.update(1)
                snapshot = torchsnapshot.Snapshot.take(path="embedding_shards",
                app_state=app_state,
                replicated=["**"],
                        )
            except StopIteration:
                if is_rank_zero:
                    print("Total number of iterations:", it)
                start_it = it
                break

I did get some weights saved in the embedding_shards directory however I am not sure how to interpret the saved directory.

In the directory embedding_shards, I see two directories - batched and sharded. batched has 8 files (names are uuids)- a total of size 196 GB

sharded has following files with a total size of 98 GB: model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_10000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_11048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_12097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_13145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_14194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_15000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_16048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_17097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_18145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_19194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_20000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_21048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_22097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_23145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_24194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_25000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_26048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_27097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_28145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_29194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_30000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_31048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_3145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_32097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_33145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_34194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_35000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_36048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_37097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_38145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_39194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_4194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_5000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_6048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_7097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_8145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_0.weight_9194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_10.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_10.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_10.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_11.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_10000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_11048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_12097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_13145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_14194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_15000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_16048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_17097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_18145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_19194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_20000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_21048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_22097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_23145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_24194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_25000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_26048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_27097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_28145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_29194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_30000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_31048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_3145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_32097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_33145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_34194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_35000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_36048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_37097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_38145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_39194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_4194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_5000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_6048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_7097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_8145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_19.weight_9194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_10000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_11048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_12097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_13145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_14194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_15000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_16048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_17097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_18145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_19194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_20000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_21048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_22097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_23145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_24194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_25000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_26048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_27097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_28145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_29194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_30000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_31048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_3145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_32097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_33145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_34194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_35000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_36048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_37097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_38145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_39194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_4194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_5000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_6048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_7097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_8145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_20.weight_9194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_10000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_11048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_12097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_13145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_14194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_15000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_16048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_17097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_18145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_19194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_20000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_21048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_22097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_23145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_24194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_25000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_26048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_27097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_28145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_29194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_30000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_31048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_3145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_32097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_33145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_34194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_35000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_36048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_37097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_38145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_39194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_4194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_5000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_6048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_7097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_8145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_21.weight_9194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_22.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_0_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_10000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_1048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_11048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_12097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_13145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_14194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_15000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_16048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_17097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_18145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_19194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_20000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_2097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_21048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_22097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_23145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_24194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_25000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_26048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_27097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_28145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_29194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_30000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_31048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_3145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_32097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_33145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_34194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_35000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_36048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_37097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_38145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_39194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_4194304_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_5000000_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_6048576_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_7097152_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_8145728_0 model.sparse_arch.embedding_bag_collection.embedding_bags.t_cat_9.weight_9194304_0

I am not sure how to interpret the saved directory and how to read all the embedding tables from the output shown above. Is there a way to gather all the weights on CPU and then dump or extract embedding tables, sharded layer weights, gather those and then dump the tables on the host.

erichan1 commented 1 year ago

apologies, can you put the issue here instead? https://github.com/mlcommons/training/issues. We'll respond there. The code is slightly different here and in the mlcommons repo.