Convex-Labs / honestnft-shenanigans

HonestNFT Shenanigan Scanning Tools
https://honestnft-shenanigans.readthedocs.io/
MIT License
176 stars 70 forks source link

WIP: scrape OpenSea for NFTs marked as suspicious #90

Closed TAnas0 closed 2 years ago

TAnas0 commented 2 years ago

Resolves #86

This a scraper to check all NFTs in an OpenSea collection and check if they were marked as suspicious or not.

You can find a list of todos/improvements at the top of the main file of the PR. They will be developed after discussion and depending on the needs.

To test the file, you can run the following command: python fair_drop/suspicious.py -c 0xbc4ca0eda7647a8ab7c2061c2e118a18a936f13d

TODOs:

@Barabazs Please let me know if you see possible improvements

nickbax commented 2 years ago

Played around with it a bit as I'm really excited to start getting useful data! This is likely the top priority to get it in a working state:

Another one I'd like to see is a sleep time tag

The error I get frequently is "connection refused". I always bypass it with:

    while res == '':
        try:
            res = scraper.get(nft_url)
            break
        except:
            print('connection error')
            time.sleep(30)
            continue

That's how I've always done it. I have no idea if requests.packages.urllib3.util.retry.Retry offers any meaningful advantage over this.

TAnas0 commented 2 years ago

Hey @nickbax ,

On saving data and resuming scraping

Data is now saved into CSVs, which I just put into the fair_drop folder. Data about the NFTs is the OpenSea link, the name of the owner and if the NFT is marked suspicious or not.

Data is also saved by bulk of 25 NFTs, so it's gradually built and even if the script unexpectedly stops, it doesn't lose all data. The scraped data into CSVs is also being used as cache and links that have already been scraped are excluded, so resuming scraping is no problem at all.

On the sleep time and retries

I've implemented some parameters in this regard:

  1. First was using request's Retry. Thanks a lot for pointing this one out. It makes the scraper more robust and resilient. I've set it to retry a maximum of 3 times each failed request (including rate limiting responses), with a backoff parameters of 8 . Which practically means it will respectively wait 4/8/16 seconds for each of the 3 retries.

  2. If even after all these retries, the scraper is still rate limited, there is a configurable option for a sleep timer, which defaults to 30 seconds. And it retries 3 times as well, before giving up on the scraping job. Both of these parameters are adjustable as such:

python suspicious.py -c <collection_address> -r 5 -s 50
python suspicious.py -c <collection_address> --retry 5 --sleep 50

I suggest you try it on the following collections, because there are still some limitations that we need to address:

python fair_drop/suspicious.py -c 0xe21ebcd28d37a67757b9bc7b290f4c4928a430b1  # The Saudis
python fair_drop/suspicious.py -c 0x78d61c684a992b0289bbfe58aaa2659f667907f8  # Superplastic: supergucci
python fair_drop/suspicious.py -c 0xb47e3cd837ddf8e4c57f05d70ab865de6e193bbb  # CryptoPunks
Barabazs commented 2 years ago

I (manually) tested the cache/retry and it seems to work as expected. :heavy_check_mark:

We have a working example of using multithreading with Retry session in pulling.py that works fairly well for us.