facebookresearch dlrm issues

facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)

MIT License

3.71k stars 825 forks source link

issues

Newest

Newest Most commented Recently updated Oldest Least commented Least recently updated

Remove syncing logic from train pipeline and use single pipeline in DLRM

#339 joshuadeng closed 1 year ago
10
Get Criteo Kaggle dataset working with TorchRec-based DLRM

#338 samiwilf closed 1 year ago
6
Align Adagrad's eps parameter for embeddings and dense layers

#337 janekl closed 1 year ago
1
Update torchrec_dlrm/README.md with instructions to replicate MLPerf DLRM v1 settings. Add DLRM v1 preprocessing script previously deleted unintentionally.

#336 samiwilf closed 1 year ago
4
Change README file to include link to blog post

#335 hjmshi closed 1 year ago
1
Are there model weights available for DLRM V2

#334 mailvijayasingh closed 1 year ago
4
[Q] Handling Boolean Features

#333 avnish-wynk closed 1 year ago
4
Accuracy Discrepancy between TorchRec and Pytorch DLRMs

#332 allenfengjr closed 1 year ago
3
Update torchrec_dlrm readme

#331 samiwilf closed 1 year ago
1
Let OSS/docker use 0.11.0 while 0.10.3 is used internally. Code is compatible with both, so this should work.

#330 samiwilf closed 1 year ago
2
Add MLPerf logging + amendments

#329 janekl closed 1 year ago
1
Change drop_last algo so it's cleaner and doesn't require a last batch of size up to 2*batch_size

#328 samiwilf closed 1 year ago
3
Use torchmetrics==0.10.3 because it was stable and worked, and torcheval has some issues.

#327 samiwilf closed 1 year ago
1
Loss is way to high when applying QR Embedding with add operation

#326 YoungsukKim12 closed 1 year ago
9
Change dataset traversal so all ranks start from consecutive batches at beginning of dataset (#930)

#325 samiwilf closed 1 year ago
10
Compute AUROC across ranks correctly

#324 janekl closed 1 year ago
1
opt dlrm into black for auto format

#323 colin2328 closed 1 year ago
3
Apply lintrunner & edit docstrings

#322 janekl closed 1 year ago
1
update README to include TorchRec tutorial; add comment to link to FBGEMM fused Adagrad call with explanation

#321 colin2328 closed 1 year ago
5
Update comments in multi_hot_criteo.py

#320 samiwilf closed 1 year ago
1
Docs + a few edits

#319 janekl closed 1 year ago
1
Finalize Dockerfile & requirements.txt

#318 janekl closed 1 year ago
1
Requirements pin fix

#317 janekl closed 1 year ago
1
pin dlrmv2 to torchrec (and fbgemm) v0.3.2

#316 colin2328 closed 1 year ago
1
add in_backward_optimizer_filter to work with in_backward_optimizers (#892)

#315 colin2328 closed 1 year ago
1
Compute AUROC using torcheval

#314 janekl closed 1 year ago
1
Add __len__ method to RestartableMap (bugfix)

#313 janekl closed 1 year ago
2
Change --drop_last to --drop_last_training_batch, applied only to the…

#312 samiwilf closed 1 year ago
5
Can't install torch rec on gcp

#311 zzh1024 closed 1 year ago
2
Add support for dropping last non-full batch

#310 samiwilf closed 1 year ago
3
Make PipelinedForward syncing transparent to caller

#309 samiwilf closed 1 year ago
1
Fix train/val/test for model using multiple train pipelines

#308 joshuadeng closed 1 year ago
1
make pg, topology and sharders optional to the planner

#307 colin2328 closed 1 year ago
1
Add support for materializing and reading materialized 1tb criteo mul…

#306 samiwilf closed 1 year ago
6
AUROC calculation with the latest torchmetrics==0.11.0

#305 janekl closed 1 year ago
3
Add support for materializing and reading materialized 1tb criteo multi-hot dataset

#304 samiwilf closed 1 year ago
2
Decouple train/val/test code by using separate pipelines for each. Remove --change_lr since --lr_scheduler can perform same behavior.

#303 samiwilf closed 1 year ago
6
Add --print_sharding_plan option to torchrec_dlrm/dlrm_main.py

#302 samiwilf closed 1 year ago
2
Make DLRM symbolically traceable with FX, and fix Python version check

#301 vkuzo closed 1 year ago
2
make pg, topology and sharders optional to the planner

#300 colin2328 closed 1 year ago
7
Flag for enabling TF32 mode for A100

#299 janekl closed 1 year ago
2
Remove variable batch size from EBC init in DLRM

#298 joshuadeng closed 1 year ago
1
How to do asynchronous distributed training with DLRM?

#297 PavithranRick closed 1 year ago
2
Add support for in-memory Criteo training set shuffle. Add supporting unit tests

#296 samiwilf closed 1 year ago
4
add MIT license to ai_codesign/dlrm

#295 colin2328 closed 1 year ago
3
use vanilla adgrad instead of row wise adagrad for reference implementation

#294 colin2328 closed 1 year ago
5
mini-batch-size and num-batches in relation to global sample number

#293 JasonFantl closed 1 year ago
3
distributed_launch doesn't work with Terabyte dataset.

#292 gakolhe closed 1 year ago
2
Decouple train/val/test code by using separate pipelines for each. Remove --change_lr since --lr_scheduler can perform same behavior.

#291 samiwilf closed 1 year ago
11
Remove dependence on torch.distributed.algorithms.join. Instead size batches such that all ranks always have the same num_batches. This is possible by increasing batch sizes by 1 sample when necessary to keep num_batches equal across ranks.

#290 samiwilf closed 1 year ago
6

Previous Next