dipu-bd / lightnovel-crawler

Generate and download e-books from online sources.
https://pypi.org/project/lightnovel-crawler/
GNU General Public License v3.0
1.44k stars 282 forks source link

Sources with problems #1605

Closed idMysteries closed 1 year ago

idMysteries commented 1 year ago

I looked at the log file and found a lot of broken sources.

Connection to wuxiaworld.co timed out. wuxiaworld.co timed out, but m.wuxiaworld.co works! home_url wuxiaworld.co -> m.wuxiaworld.co?

Connection to novelcrush.com timed out. The crawler needs to be moved to _down

Connection to readnovelz.net timed out. The site does not redirect. Someone bought this domain. move to _down

Tests needed:

    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Temporarily Unavailable for url: https://888novel.com/tim-kiem/?title=the&he_liet=yes&status=all

    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Temporarily Unavailable for url: https://www.fanfiction.net/search/?keywords=the&type=story&match=title&ready=1&categoryid=202

    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Temporarily Unavailable for url: https://www.fictionpress.com/search/?keywords=the&type=story&match=title&ready=1&categoryid=202

    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Temporarily Unavailable for url: https://kissmanga.in/?s=the&post_type=wp-manga&author=&artist=&release=

    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 503 Server Error: Service Temporarily Unavailable for url: https://novelgate.net/search/the
move to _down?
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://clicknovel.net/?s=the&post_type=wp-manga

SEARCH FIX NEEDED:
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://mtlreader.com/search

move to _down?
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 500 Server Error: Internal Server Error for url: https://wuxiaworld.io/search.ajax?type=&query=the
idMysteries commented 1 year ago
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://dragontea.ink/?s=the&post_type=wp-manga

    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.lightnovelpub.com//search?title=the

    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 451 Client Error: Unavailable For Legal Reasons for url: https://light-novel.online/search.ajax?query=the

    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 451 Client Error: Unavailable For Legal Reasons for url: http://ww38.lightnovel.tv/?s=the&post_type=wp-manga&author=&artist=&release=

    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://novelfullplus.com/ajax/search?q=the

    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.novelpub.com//search?title=the

    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://readnovelfull.com/search?keyword=th

    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://wuxiaworld.live/search.ajax?type=&query=the

    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://es.mtlnovel.com//wp-admin/admin-ajax.php?action=autosuggest&q=the

    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://www.mywuxiaworld.com/search/result.html?searchkey=the

    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://docln.net//tim-kiem-nang-cao?title=the
idMysteries commented 1 year ago
2022-09-23 10:51:36,508 [DEBUG] (lncrawl.core.crawler)
HTTPSConnectionPool(host='truyentr.pro', port=443): Max retries exceeded with url: /?s=the (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000000016CFFEB0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed')) | Retrying...
2022-09-23 10:51:36,509 [DEBUG] (lncrawl.core.crawler)
[GET] https://truyentr.info/?s=the

truyentr.info redirects to truyentr.pro -> error

idMysteries commented 1 year ago

HTTPSConnectionPool(host='asadatranslations.com', port=443): Max retries exceeded with url: /?s=the&post_type=wp-manga&author=&artist=&release= (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x0000000008F5E8B0>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed')) | Retrying...

jere344 commented 1 year ago

Theses souces use cloudflare, perhaps it is the issue ? https://888novel.com/ -> cloudflare issue https://kissmanga.in/ https://clicknovel.net/ https://mtlreader.com/ https://wuxiaworld.io/ https://light-novel.online/ https://readnovelfull.com/ https://wuxiaworld.live/ https://es.mtlnovel.com/

And some seems not to work because of the double // : https://docln.net // tim-kiem-nang-cao?title=the -> error https://docln.net / tim-kiem-nang-cao?title=the -> work

dipu-bd commented 1 year ago

I do not know how many of these issues are fixed. It is hard to track if all are posted under one issue. I am closing this for now. Please report source issues separately for each sites.