WIP: scrape OpenSea for NFTs marked as suspicious

TAnas0 commented 2 years ago

Resolves #86

This a scraper to check all NFTs in an OpenSea collection and check if they were marked as suspicious or not.

You can find a list of todos/improvements at the top of the main file of the PR. They will be developed after discussion and depending on the needs.

To test the file, you can run the following command: python fair_drop/suspicious.py -c 0xbc4ca0eda7647a8ab7c2061c2e118a18a936f13d

TODOs:

[x] Storage: Save the results into a CSV, including collection, NFT ID, date it was last scraped...
[x] Resiliency: resume scraping from where it was left off if network breaks for example
[x] Multithreading: distribute scraping to multiple processes (5-10) for performance improvements
(ABANDONED) Refactor: depending on the OpenSea API
[x] Respect OpenSea rate limits with retries and sleeps and handle 429 error response

@Barabazs Please let me know if you see possible improvements

nickbax commented 2 years ago

Played around with it a bit as I'm really excited to start getting useful data! This is likely the top priority to get it in a working state:

[ ] Resiliency: resume scraping from where it was left off if network breaks for example

Another one I'd like to see is a sleep time tag

[ ] -s 1 for sleep time of 1 second between pull attempts

The error I get frequently is "connection refused". I always bypass it with:

    while res == '':
        try:
            res = scraper.get(nft_url)
            break
        except:
            print('connection error')
            time.sleep(30)
            continue

That's how I've always done it. I have no idea if requests.packages.urllib3.util.retry.Retry offers any meaningful advantage over this.

TAnas0 commented 2 years ago

Hey @nickbax ,

On saving data and resuming scraping

Data is now saved into CSVs, which I just put into the fair_drop folder. Data about the NFTs is the OpenSea link, the name of the owner and if the NFT is marked suspicious or not.

Data is also saved by bulk of 25 NFTs, so it's gradually built and even if the script unexpectedly stops, it doesn't lose all data. The scraped data into CSVs is also being used as cache and links that have already been scraped are excluded, so resuming scraping is no problem at all.

On the sleep time and retries

I've implemented some parameters in this regard:

First was using request's Retry. Thanks a lot for pointing this one out. It makes the scraper more robust and resilient. I've set it to retry a maximum of 3 times each failed request (including rate limiting responses), with a backoff parameters of 8 . Which practically means it will respectively wait 4/8/16 seconds for each of the 3 retries.
If even after all these retries, the scraper is still rate limited, there is a configurable option for a sleep timer, which defaults to 30 seconds. And it retries 3 times as well, before giving up on the scraping job. Both of these parameters are adjustable as such:

python suspicious.py -c <collection_address> -r 5 -s 50
python suspicious.py -c <collection_address> --retry 5 --sleep 50

I suggest you try it on the following collections, because there are still some limitations that we need to address:

python fair_drop/suspicious.py -c 0xe21ebcd28d37a67757b9bc7b290f4c4928a430b1  # The Saudis
python fair_drop/suspicious.py -c 0x78d61c684a992b0289bbfe58aaa2659f667907f8  # Superplastic: supergucci
python fair_drop/suspicious.py -c 0xb47e3cd837ddf8e4c57f05d70ab865de6e193bbb  # CryptoPunks

Barabazs commented 2 years ago

I (manually) tested the cache/retry and it seems to work as expected. :heavy_check_mark:

We have a working example of using multithreading with Retry session in pulling.py that works fairly well for us.

Convex-Labs / honestnft-shenanigans

WIP: scrape OpenSea for NFTs marked as suspicious #90

On saving data and resuming scraping

On the sleep time and retries