Hamuko / cum

comic updater, mangafied
Apache License 2.0
171 stars 15 forks source link

Add a scraper for mangadex.com #56

Closed mxnemu closed 6 years ago

mxnemu commented 6 years ago

Hello, this adds a scrapper for mangadex.com.

I ran the test cases with:

python3 -m unittest discover -s ./tests/ -p 'test_scraper_mangadex.py'

and they are succeeding. I used follow and download manually for a few series, then I looked at a bunch of archives with mcomix and everything works as expected. When I try to run all tests (without the -p option) +100 tests fail, but I assume that's all accountable to batoto, since it is used in a lot of test cases. All of the testing was done on GNU/Linux.

There is a language field that I'm parsing, but I didn't want to hardcode a filter for English like it was in batoto, since I think this should be done through a config.

I haven't written any python code before and I'm looking forward to feedback.

mxnemu commented 6 years ago

I updated the test_series_aria test-case. The series_information_tester was failing, because someone uploaded a new extra chapter.

mxnemu commented 6 years ago

I updated it to use urljoin and urlparse that seems like a good improvement over string concatation. I also the hard coded English language filter. Personally I would expect that everything is downloaded unless I filter it, but it's alright if that's historically expected everywhere in the program I guess.

Hamuko commented 6 years ago

Code-wise I'd say that this is good enough to merge now. I was planning on testing it out a bit more and then doing the merge, but…

Update (10-Feb) Notice: Using Free Manga Downloader to download images may result in your IP being banned because of the abnormal number of hits to the site, due to the way they have implemented their method of image downloading.

I spotted this notice at the top of MangaDex's website and it's a bit worrying, as I don't really want to have a scraper that leads to IP bans if you use it. Might be possible that threaded downloads is out of the question, as with default settings cum will download images in four threads. Their announcement doesn't seem to indicate that downloading images by automation is inherently bad, just that what FMD is doing is bad.

mxnemu commented 6 years ago

I've been downloading a bunch of stuff and I haven't managed to get banned yet. A few days ago I had some really slow response times, but could reproduce them from a server with an other ip/location/useragent and it seems better right now. Today I ran the test cases 6 times in succession. The first time one download (146/150 during the hidamari test) got stuck and I aborted. The following runs all passed. I assume that mangadex.com is not completely stable yet and dropped some requests there. Each run downloads 200-300 pages and I think it uses the default threadpool, so I guess that bans aren't a problem right now.

Times were pretty consistent, so there also doesn't seem to be any kind of softban via slowdown: Ran 9 tests in 148.367s Ran 9 tests in 127.665s Ran 9 tests in 142.812s Ran 9 tests in 122.557s Ran 9 tests in 141.010s

Hamuko commented 6 years ago

Apparently the banning is not done by Mangadex, but rather Cloudflare. Seems like they found about it after FMD users started reporting bans.

I'll look at merging this soon enough. I'll try to get everything done for v0.9 at once.