Kaggle / kaggle-api

Official Kaggle API
Apache License 2.0
6.01k stars 1.06k forks source link

Cannot list datasets beyond page 500 #553

Closed lbhm closed 3 months ago

lbhm commented 3 months ago

In the context of a research project, I would like to gather metadata about all datasets that match certain keywords (e.g., age). Conceptually, this should be no problem with the Kaggle API by iterating through the result pages.

However, I noticed that there seems to be some kind of shadow limit on the number of result pages for a search query. For example, the query age [1] has about 16K results according to the web UI. Nevertheless, every result page beyond 500 is empty.

Thus, my question: Is there any way to acquire dataset metadata for queries with more than 10000 results?

[1] https://www.kaggle.com/datasets?search=age

jplotts commented 3 months ago

Hi @lbhm - it's true that there are limits on how far back you can search via the API. We may look into revising those in the future, but there's no definitive timeline at this point. One workaround could be using the Meta Kaggle dataset. Admittedly, this would only allow you to query titles, so it's a bit less than you'd get from kaggle.com, but perhaps it's sufficient for your purpose.

lbhm commented 3 months ago

Thank you for the quick response @jplotts.

Unfortunately, just having titles/IDs is not quite enough for us. In this research project, we are looking into dataset search techniques that go beyond basic keyword search, so I am especially interested in column statistics (e.g., ranges, number of unique values, null values, etc.) to explore new search techniques.

Please allow me two follow-up questions:

jplotts commented 3 months ago

Hi @lbhm -

  1. I agree that the column statistics would be nice, but unfortunately, that's not on our roadmap right now.
  2. We have rate limits on our API endpoints to prevent abuse. We don't track the total number of downloads a user makes, so you can download an arbitrary amount as long as you spread the requests out. I can't provide specific rate limit numbers (they may change anyway), but we believe they are sufficient for most reasonable use cases.
lbhm commented 3 months ago

Alright, I'll download the datasets with an appropriate rate and recreate the column statistics myself then.

Thank you for your help!