ccloli / E-Hentai-Downloader

Download E-Hentai archive as zip file
GNU General Public License v3.0
1.88k stars 138 forks source link

Download history + skipping of already downloaded images #177

Open TestPolygon opened 3 years ago

TestPolygon commented 3 years ago

I would like to have on option to hold the download history. To download images only once. Do not download an image again when I download the other gallery containing this image (even the image has the other name).

Each image has ID (the first ~10~ 5 bytes of SHA-1), you can see it in URLs: .../s/{image-id}/{gallery-id}-{position}. The same images have the same {image-id}, only {gallery-id} and {position} are differed. (note: the file name may be different too)

You can store IDs of already downloaded images to implement an option that with enabled one the userscript will download only images that were not downloaded before.

Also the skipped images should be noted in info.txt. (As it is now, just do not download them) and there is no need to fetch the page with the image: the file name, possition and the image ID are visible from the gallery page.

TestPolygon commented 3 years ago

It would be very useful for people who want to download all images by some author tag. No unnecessary requests, less downloaded data (because of there is no duplicates). Easilier to download a gallery updates.

TestPolygon commented 3 years ago

You can also mark downloaded images in galleries with the colored border. I think it would be useful too.

ccloli commented 3 years ago

Yes, the id is the first 10 letters of the file SHA-1, however it may not accure enough. Just now I noticed the thumb url contains full SHA-1.

For example, the thumbnail url of https://e-hentai.org/s/e91885fc7e/1809948-57 is https://ehgt.org/t/e9/18/e91885fc7ec0212a975ef371d9c9816c3f3807b0-768879-884-1250-jpg_l.jpg.

The first 10 letters of file name part of thumbnail url is exact as the page url SHA-1 part e91885fc7e, so we can determine the format of thumbnail url is https://${domain}/t/${sha1.substr(0,2)}/${sha1.substr(2,2)}/${sha1}-${size}-${width}-${height}-${format}_${thumbsize}.jpg.


Forget it, it only works if you use large thumbnails. If you're using small thumbnails, the thumbnail url is a sprite image.

TestPolygon commented 3 years ago

I don't know how the site handles the sitiation when the different images have the hashes with the same first five bytes. Probably it gives a unical ID, or maybe not.

It does not seem critical to work only with the first 5 bytes (10 hex characters, or 40 bits). 5 bytes are enough to give IDs for 1099511627776 items (2**40, or 16**10, or 256**5). Not sure, is Birthday attack can be applied to this case, but with counting of this 2**20 is 1048576 – not great, not terrible.

But it looks OK, if this feature would work with only enabled "Large" (not "Normal") thumbnails if you want to store the full hashes.