brianleect / etherscan-labels

Full label data dump of top EVM chains in JSON/CSV.
MIT License
249 stars 73 forks source link

[Bug] Etherscan scraping broken #30

Closed brianleect closed 1 year ago

brianleect commented 1 year ago

Length of values (0) does not match length of index (3)

Code of interest

# Retrieve all addresses from table
                elems = driver.find_elements("xpath", "//tbody//a[@href]")
                addressList = []
                addrIndex = len(baseUrl + '/address/')
                for elem in elems:
                    href = elem.get_attribute("href")
                    if (href.startswith('baseUrl/address/')):
                        addressList.append(href[addrIndex:])

                # Quickfix: Optimism uses etherscan subcat style but differing address format
                if targetChain == 'eth':
                    # Replace address column in newTable dataframe with addressList
                    curTable['Address'] = addressList
brianleect commented 1 year ago

Suspected cause would be that element extraction to get address no longer works, perhaps UI change?

brianleect commented 1 year ago

UI change should not have impacted since code extracts for all mentions of href . weird. need to research deeper.

brianleect commented 1 year ago

Basic scraping fixed by https://github.com/brianleect/etherscan-labels/commit/fa2ca24da39cab7961169a89076da6047c5d201c

New problem identified is that subcatid does not seem to be reliably retrieved. E.g. 1inch retrieval only retrieves main, subcatid is empty for some reason.

However, in certain cases as augmented-finance we see successful subcat_values retrieval augmented-finance subcat_values: ['1', '0']

Suspected cause, maybe not enough time to load and scrape subcat id?

brianleect commented 1 year ago

retried again and it failed on the first try but worked on the follow ups? Not sure what is the cause.

Steps

  1. Call single 1inch (No subcats)
  2. Call single augmented-finance (Subcats found)
  3. Call single 1inch (Subcats now found)

However, unable to replicate. Suddenly it started working.

brianleect commented 1 year ago

May be able to determine faulty labels by a decrease in label count from the previous scrape.