Paper-Selection - Githubissues

e11824496 commented 1 year ago

SPLADE Papertitle: SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking PDF Authors: Thibault Formal, Benjamin Piwowarski, and Stéphane Clinchant Conference: SIGIR Year: 2021 DOI: https://doi.org/10.1145/3404835.3463098 Code: https://github.com/naver/splade Estimated Difficulty of Reproducibility: ? Chosen Paper: ? Group No.: 17

UltraGCN Papertitle: UltraGCN: Ultra Simplification of Graph Convolutional Networks for Recommendation PDF Authors: Mao, K., Zhu, J., Xiao, X., Lu, B., Wang, Z., and He, X Conference: CIKM Year: 2021 DOI: https://doi.org/10.1145/3459637.3482291 Code: https://github.com/xue-pai/UltraGCN Estimated Difficulty of Reproducibility: ? Chosen Paper: ? Group No.: 17

Random Sampling Plus Fake Data Papertitle: Random Sampling Plus Fake Data: Multidimensional Frequency Estimates With Local Differential Privacy PDF Authors: Héber H. Arcolezi, Jean-François Couchot, Bechara Al Bouna, and Xiaokui Xiao Conference: CIKM Year: 2021 DOI: https://doi.org/10.1145/3459637.3482467 Code: https://github.com/hharcolezi/ldp-protocols-mobility-cdrs Estimated Difficulty of Reproducibility: ? Chosen Paper: ? Group No.: 17

e11824496 commented 1 year ago

SPLADE

Depending on what level of reproducibility we go for, it might be hard to do. The team behind the paper published some fine-tuned models on hugging-face (not for SPLADE but a newer version published in SIGIR22). If we use this model and only evaluate it from there, we might be able to reproduce it exactly, as the model is already trained. Fine-tuning a pre-trained model might not be that easy as the authors trained theirs on 4 GPUs (and I don't have access to such computing power). Reproducing it with one GPU might skew the hyper-parameters as mentioned in their GitHub Repo as we need to adjust the batch size during training. They do provide a config setting for mono-GPU (for the new version). The whole project seems quite complicated on a brief look but they do provide a good bit of resources and information in their repo so it might be quite doable. They do provide their exact training data.

PS: I worked with BERT and contextualized embeddings before, but never in the context of IR.

UltraGCN

Github-Repo seems quite straight forward with a single file and multiple config files for various datasets. Authors trained on RTX 2080 which we might not have but at least its not overly powerful and we might be able to reproduce a similar result on a different system in reasonable time. I didn't look into details regarding the paper itself but it seems quite doable.

On a different not: Apparently they don't perform train-validation-test split but only train-test which might result in over fitting as we consider the best model on the test set.

PS: I barely worked with graph convolutional networks so hard to estimate difficulty for me.

Random Sampling Plus Fake Data

I don't understand this paper yet, so take everything I tell you here with a big grain of salt. It seems like an interesting topic but hard to estimate how this turns out regarding reproducing it. The github page provides some jupyter notebooks and if this is all we need to run to reproduce the results, it might be easy.

\ \ Please chime in if I missed something or you think otherwise.

neuro-data commented 1 year ago

Thanks for the awesome summary! I looked at the repos and have a similar impression - RS+FD looks quite accessible and UltraGCN seems also well-organized and not too complex.

Disclaimer: I barely worked with any of the topics.

Tandemonium commented 1 year ago

yeah nice summary, thanks

SPLADE

thats quite a complicated code. also a lot of advanced NLP stuff, which would be already difficult to just understand also if they used such crazy hardware, i cant imagine how long we will need to reproduce anything difficulty: hardest

UltraGCN

code looks nice and probably understandable, i think its interesting seems like everything needed for reproducing is there and well described i don't have such a good GPU either, but that makes it just more interesting to compare results difficulty: doable

Random Sampling

the code is pretty short and looks quite simple (a bit too simple) the theory however is pretty complicated, I also didn't really understand, and never heard about this topic before to me, this just doesn't like wise to choose

i would go with UltraGCN, the code should be understandable, reproducing rather easy and I find it interesting

e11824496 / ExpDesign_WS22

Paper-Selection #1

SPLADE

UltraGCN

Random Sampling Plus Fake Data

SPLADE

UltraGCN

Random Sampling