Add support for mangadex.com

mxnemu commented 6 years ago

Bato.to is shutting down and doki's new website mangadex.com seems to be positioning itself as the main replacement. RSS seems broken right now so I would wait with implementing the follow function, but scraping archives can probably already be implemented.

Hamuko commented 6 years ago

Seems like a worthy addition. I need to take a look if I have some time to implement it. I also need to make an update that rips out all of the Batoto code. Most of it is broken anyways.

CounterPillow commented 6 years ago

Looked at the DOM a little.

RSS is indeed broken and just redirects to the main page, so no point in trying to extract data from that.

Pages for a series are of the URL mangadex.com/manga/<mangaid>. They contain a (sadly id-less) table with an <a> tag pointing towards the chapter, and with the chapter name in the tag's contents, which can apparently contain volume number and chapter number as well as chapter name, e.g. Vol. 3 Ch. 45 Foo Bar, but can also only contain just the chapter and the useless string "Read Online", e.g. Ch. 10 Read Online. The table also contains scanlation group info.

Pages for a chapter are of the URL mangadex.com/chapter/<chapterid>, where individual pages are of the URL mangadex.com/chapter/<chapterid>/<pagenum>. Clicking on the chapter appears to append the page number /1 to the URL in JS, but curling just the chapter URL without a page number appears to return the first page fine.

The <img> tag with the id "current_page" appears to be the current page image. All available pages can be gotten from the <select> with the attribute id="jump_page". The image src appears to point towards mangadex.com/data/<somehash>/<pagenum>.<ext>.

We can probably re-use some of the batoto code for this, e.g. the page URL prediction.

We may also want to use cloudflare-scrape, as the site is proxied by cloudflare. Sadly, that comes with node.js as a dependency, so maybe it should be optional.

mxnemu commented 6 years ago

I tried to edit the bato.to scraper myself and had some success. I was able to download all chapters for a manga with: python3 -m cum.cum get https://mangadex.com/manga/13805

Missing features & bugs are:

Two 22-byte garbage archives of the first downloaded chapter were generated.
I haven't implemented from_url, yet.
It's not handling mirror servers, yet.
It's not parsing groups, yet.
It's not parsing volume numbers.
Still has some debug output and unused variables.
Obviously this will need more testing.
I removed the batoto scraper. This might cause db errors with code in cum/db.py, maybe some error handling code will be required here?

Except for being banned when using Tor, I didn't seem to get blocked by cloudflare, maybe they have some anti-bot settings disabled at the moment.

When I have some more time I'll make a clean patch and send a pull request.

Work in progress version: https://github.com/mxnemu/cum

CounterPillow commented 6 years ago

Except for being banned when using Tor, I didn't seem to get blocked by cloudflare, maybe they have some anti-bot settings disabled at the moment.

Cloudflare will only give you trouble in one of two situations,

the site is in "I'm under attack" mode, which can be automatically bypassed by executing some JS. Normally, websites should not ever be in this mode, unless they're having issues with a layer 7 DDoS.
your IP has a bad reputation, which is common for tor exit nodes, since they're mainly used for spam and vulnerability scans. This cannot be bypassed automatically.

We only really care about situation number 1, and since mangadex doesn't seem to have a need for IUAM, we don't have a need for bypassing it either. My mention was mainly for future reference.

mxnemu commented 6 years ago

@CounterPillow Thanks for the info that's good to know. I hope node won't bee required, but cloudflare really sucks.

I've been downloading a bunch of stuff yesterday and it seems to work good. I'm writing some test cases now.

mxnemu commented 6 years ago

The pull request is here: https://github.com/Hamuko/cum/pull/56 I decided not to remove the batoto stuff for now, since I'm not sure what's happening at vatoto.com.

Hamuko commented 6 years ago

Currently in master, soon in v0.9.

Hamuko commented 6 years ago

@mxnemu Just a heads up that I found a bug with the scraper as I was migrating over my old Batoto follows. The regex did not have support for chapter versions. I actually managed to get for a decent amount before encountering one, but when encountered it really breaks everything.

(venv) cum > cum new
konobi 2  3  4  5  6  7  8.5  9.1  9.2  10  11  12  13  14  15  16  17  18  19  19.5  20  21
       22  23  24  25  26  27  28  29  29.1  30  31  32  33  33.5  34  35  36  36.5  37  38  38.5
       39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  Vol. 1 Ch. 1
       v2 - Read Online  Vol. 2 Ch. 8 v2 - Read Online  Vol. 2 Ch. 9 v2 - Read Online

I added an optional non-capturing group for the version numbering in the commit 892f9281357863bbf91bdc68c081c6f09ec7ac5a.

Here's an example of it being used: https://mangadex.com/manga/8819. I'm actually not sure if this is actually following the rules, because I don't see anything about version numbering on the upload page. However, I'm inclined to leave this in at least for now so that we get clean chapter numbers instead of the garbage that you can see above because clearly at least someone is doing that.

This is probably due for a review later in case there is some standard over how version numbering is handled. I think I saw someone else add the version at the end of the chapter name, so we're probably going to have data all over the place at least for the time being.

Hamuko / cum

Add support for mangadex.com #55