merge and update for etherscan

brianleect / etherscan-labels

Full label data dump of top EVM chains in JSON/CSV.

MIT License

263 stars 78 forks source link

merge and update for etherscan #38

Closed c0mm4nd closed 1 year ago

c0mm4nd commented 1 year ago

Some changes:

increased the interval due to my poor network
unified all next page behavior via clicking > button, rather than visiting new page
removed the ignore list (most fixed
- fixed some by manually input (e.g. liqui.io
- fixed some by //div[contains(@class, "active")] and td[{td_index}] in XPath (e.g. old contracts
formatted with the black formatter, changed most varName to snake_case in PEP8

Some havent upload:

tried adding cookies saving and loading for seleium, to avoid re-login when sudden restarting,
- but failed. All related codes are removed. Now still looking for solutions.

brianleect commented 1 year ago

Thanks for the improvements @c0mm4nd

There's an ongoing fix https://github.com/brianleect/etherscan-labels/pull/37 that I'm working on that fixes the label truncation for etherscan token names as mentioned in https://github.com/brianleect/etherscan-labels/issues/34 .

So I think we will go with the merged data from the PR I did, I'm currently rescraping due to a small bug and I'll merge it soon.

As for your changes, I think the design/formatting fixes look good, but I'm not too sure with regards of unifying next page behavior using '>' , as I think incrementing by index seems more flexible assuming possible style changes in the site?

And cookie saving would definitely be great if we can figure it out, sadly I've not really found a solution for it, maybe I'll try giving it another shot when I've more time as well.

c0mm4nd commented 1 year ago

Reason for choosing clicking > button:

start=100 is not working on etherescan token page
- e.g. https://etherscan.io/tokens/label/defi?subcatid=0&size=50&start=100&col=3&order=desc
- the start=100 in url, but the body will show the start=0 content
- and it will modify the param back to start=0
click > will change the url in javascript from start=0 to start=100, which means it is doing the same thing as the index incrementing
less global refresh, significantly bypassing the detection of cloudflare

KNOWN BUG: It works for most label pages, but for few, like beacon-depositor (due to the large size), the response of which always delayed. the content will may be duplicated. SOLUTIONS:

simply increase the sleep time
or check the address_list with the previous

c0mm4nd commented 1 year ago

https://metadata.etherscan.io/api-endpoint/address-metadata

Close PR since it looks like this API can provide better results.