dawsbot / eth-labels

📃 A public dataset of crypto addresses labeled
https://eth-labels-production.up.railway.app/swagger
MIT License
190 stars 29 forks source link

Implement pagination crawling (use coinbase as an example) #75

Closed dawsbot closed 3 months ago

dawsbot commented 3 months ago

Blocked by #83

https://etherscan.io/accounts/label/coinbase has 133,654 Accounts, yet our file has only 126 accounts

https://github.com/dawsbot/evm-labels/blob/v1/data/etherscan/coinbase/accounts.json

kylewandishin commented 3 months ago

@dawsbot i looked into this and i believe this is just a product of how we do not pull any pages with more than 10000 records. looking through previous commits i have found that we pulled 27 records from coinbase in 6dee43254ad36a9b2763dd6f2ab5eff2aa0f7169, then 126 records in 69b31a76b97beecdff10b8f4eb9d94bf51c2ada7, and now that it exceeds 10,000 we are no longer pulling. i think it could be beneficial to clear the current etherscan data to remove unmaintained data.

at a minimum that file has not been pulled for at least 3 weeks and was not updated by my most recent pull in commit e4de3eb08eade44472e66abb564adcc3ff4ea4d9

dawsbot commented 3 months ago

@kylewandishin I don't understand why it wasn't updated in the last 3 weeks. Whenever we pull down etherscan data we should overwrite this file no matter what, right?

kylewandishin commented 3 months ago

@dawsbot if there is nothing to pull we dont overwrite the file i think we should implement logic for removing files but we may want to meet and decide on the best method

dawsbot commented 3 months ago

Deal, let's discuss this tomorrow IRL @kylewandishin

This issue we're discussing this under is still an issue though and this is highest priority moreso than the "extra + stale data" issue you're describing

kylewandishin commented 3 months ago

this issue (coinbase missing records) is a side effect of the extra/stale data. if we decide how to handle it tomorrow and fix it, we will be able to close this issue as well.

@dawsbot #75 is just an example of the stale/extra data problem.