aks2203 / easy-to-hard-data

Pytorch Datasets for Easy-To-Hard
MIT License
25 stars 4 forks source link

Top level README should factor in batch size when giving indices? #3

Closed rozim closed 3 years ago

rozim commented 3 years ago

Hi, I'm just starting to read your paper and review the code, and I'm wondering if the top level README.md is a bit off.

eg here it says: We sorted the chess puzzles provided by Lichess, and the first 600,000 easiest puzzles make up an easy training set. Testing can be done with any subset of puzzles with higher indices. The default test set uses indices 600,000 to 700,000.

but the chess test script uses batch size of 3000, thus it seems that the last sentence should be "The default test set uses indices from 1.8M to 2.1M". ref: https://github.com/aks2203/easy-to-hard/blob/main/chess/launch/test_models.sh

I may of course misunderstand things, but I would expect that the data set, in isolation, is not aware of the batch size.

thx, Dave

aks2203 commented 3 years ago

Hi Dave,

The indices (600,000 and 700,000) are irrespective of batch size. The training data is then divided into batches. So if there are 600,000 puzzles in the training set and the batch size is set to 3,000, then there are 600,000/3,000 = 200 puzzles per batch.

Does that clear things up? Am I misunderstanding your question?

Avi

rozim commented 3 years ago

Thanks - probably my earlier message was not as clear as it could have been - since this repository is about the data, I was expecting to see statistics on the data, independent of how the models use the data, so for chess I'm interested in how many positions there are.

BTW - make_chess.py (from here: https://github.com/aks2203/easy-to-hard-data/blob/main/make_chess.py) refers to deepthinking_lichess.csv but I don't see where that file is downloaded from, and it undoubtably has the information.

I guess I could figure out how to parse the *.pth files in chess_data/ however I'm a Tensorflow person, not a Torch user :)

thx, Dave

aks2203 commented 3 years ago

Hi Dave,

Thanks for pointing out that a link to the CSV file is missing. It has been added to the README and for convenience, here is the link.

Can you be more specific about what you mean "how many positions there are"? Would you like to know the number of possible chess boards or the number of puzzles in this dataset?

rozim commented 3 years ago

Oh, I was interested in the number of puzzles in the dataset, I guess the answer is 1,505,097 from:

wc -l deepthinking_lichess.csv 1505097 deepthinking_lichess.csv

FYI, in the paper at https://arxiv.org/pdf/2108.06011.pdf, References [2], the URL may have changed, (or '#' was stripped out in formatting...) the paper at that URL has: https://database.lichess.org/puzzles but this works: https://database.lichess.org/#puzzles

thx, Dave

aks2203 commented 3 years ago

Thank you the link is being fixed in the paper. Are there any other questions? Can I go ahead and close this issue?

rozim commented 3 years ago

Yes sounds good please close. Thanks for the explanations.

On Sat, Sep 25, 2021, 8:09 AM Avi Schwarzschild @.***> wrote:

Thank you the link is being fixed in the paper. Are there any other questions? Can I go ahead and close this issue?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/aks2203/easy-to-hard-data/issues/3#issuecomment-927134553, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABOGB37ZWEIHUW7OJEI72LUDXQ35ANCNFSM5ENCAEMA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.