Maximax67 / LoRA-Dataset-Automaker

An advanced Jupyter Notebook for creating precise datasets tailored to stable Diffusion LoRa training. Automate face detection, similarity analysis, and curation, with streamlined exporting, utilizing cutting-edge models and functions.
MIT License
29 stars 1 forks source link

Failed to fetch the website #8

Closed ryuji99 closed 4 months ago

ryuji99 commented 6 months ago

image

Maximax67 commented 5 months ago

It appears that Fancaps has implemented Cloudflare protection to prevent data scraping from their website. My recent fix doesn't help to resolve this. The site responds with a 'Forbidden' error. I'm currently exploring the possibility of implementing an HTML iframe in the google colab cell to enable users to solve captchas if necessary, although I'm uncertain if this approach will be effective. I'm seeking a solution to solve this issue.

ShinobiiSpartan commented 5 months ago

ya I'm getting this as well

padamix commented 5 months ago

@Maximax67 I managed to work around the issue on my local, but I altered the notebook quite a bit, so I'll just note some points that helped me bypass this problem.

Issue 1 -- CloudFlare I bypassed this by opening the site in my browser and copying the User-Agent from the network tab and the cf_clearance cookie. This method usually works with CF, but sometimes sites also check the referrer and the host headers as well. So, I added the cf_clearance cookie, the User-Agent header and some other hard-codable headers to the request. Something like this:

header = {
    'Host': 'fancaps.net',
    'User-Agent': user_agent,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br, zstd',
    'Connection': 'keep-alive',
    'Referer': 'https://fancaps.net/search.php'
}
dl_header = {
    'Host': 'cdni.fancaps.net',
    'User-Agent': user_agent,
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    'Accept-Encoding': 'gzip, deflate, br, zstd',
    'Connection': 'keep-alive',
    'Referer': 'https://fancaps.net/search.php'
}
cookies = {
    'cf_clearance': cf_clearance
}

I wired the cf_clearance and user_agent to the notebook as a manual string input. After that, I changed the calls to use headers=header, cookies=cookies, for example requests.get(url, headers=header, cookies=cookies)

Issue 2 -- requests -> httpx I could not get any of the requests working with fancaps even after all of the above, so I swapped out requests with httpx. Some hints:

Not sure if both were needed or just one of them. Hope this helps.

Maximax67 commented 5 months ago

Hi, I tried it locally, everything worked, thanks. This approach is better than what I tried to do. In a few days I will commit a version with the problem fixed. Thank you!

Maximax67 commented 5 months ago

I encountered a problem with this method. When I launched the notebook in colab, this method stopped working. cf_clearance cookie includes not only user agent, but also the IP address from which the request is made. Therefore, everything works fine locally, but when requests come from Google Colab, they do not go through.

I wanted to try to display a selenium browser window under the cell with the code as an iframe in Google Colab, so that the user himself would pass the captcha or just wait for the redirect to fancaps, and then use the resulting cf_clearance for httpx requests. But I did not find a way to do something like this.

Now I have discovered that the images themselves are not protected by Cloudflare. From the domain cdni.fancaps.net, I can download images using the request.get(). I only need to bypass cloudflare protection to search for the anime we need and get the ID of the images that the user wants to download. I also found links where I can parse all the movies, TV series and anime that have screencaps available on fancaps: https://fancaps.net/anime/showList.php https://fancaps.net/tv/showList.php https://fancaps.net/movies/showList.php

I also noticed that all the image IDs are in order for one episode or movie. I can parse the fancaps site and make one json or csv file, in which for each movie and episode of the TV series and anime there will be two numbers: the id of the first image, the id of the last image. This way there will be no need to try to bypass fancaps protection and send requests to fancaps.net. Images by ID can be downloaded without problems since cdni.fancaps.net is not protected.

padamix commented 5 months ago

@Maximax67 I could embed the fancaps site and managed to list shows using iframes (but since I have mostly switched to using a different source altogether, I only did a limited amount of testing).

I embedded fancaps with this:

from IPython.display import IFrame
IFrame(src="https://fancaps.net", width=1500, height=960)

Running this cell will immediately make some requests from your browser, so you can see the user-agent (it will still match your current browser) and the fancaps cf_clearance will be present in the storage/cookies/"https://fancaps.net".

After this I patched back some of the original listing code to check it and I got search results; you might not even need to set the cookies anymore, since both the iframe and the colab session will be the same page, so it should be shared. You should still need the user-agent though.

Maximax67 commented 4 months ago

Sorry for the long work time. I parsed the entire fancaps.net and uploaded it to a separate repository as json files. Now my Dataset Automaker loads and parses it. I solved the problem.