elvis972602 / Kemono-scraper

A simple downloader to download media from kemono.party
MIT License
178 stars 7 forks source link

Download posts from different periods #31

Open 1223334444abc opened 9 months ago

1223334444abc commented 9 months ago

https://kemono.party/fanbox/user/2557134/post/5773097

Kemono seems to offer posts from different periods. As shown in the link above, attachments have been removed in newer versions. I hope to download all versions to ensure file integrity. I hope to complete the historical versions without affecting the existing image database. If there is a unique numerical identifier or tag to distinguish versions?

Downloading only the latest or only the oldest posts is not appropriate, as some authors prefer to add new content while others prefer to delete content. It is necessary to save all versions, but different versions of posts may have the same name.

I have been thinking for a long time but haven’t come up with a good naming solution (without affecting the current database content, especially the folder hierarchy).

The current folder and file structure:

\[fanbox]noeyebrow
        \[20230421] [5773097] 🚀「明日もな」原寸JPG配布終了
                0.jpg
                content.html
1223334444abc commented 9 months ago

When encountering versions with different titles, download all of them. When encountering versions with the same title, download the version with the most files and images. This seems to be a rather crude but feasible solution.

This solution ensures downloading all files and maintaining the original directory structure, but ignores the possibility of the creator modifying images (such as adding overlay modifications).

\[fanbox]noeyebrow
        \[20230421] [5773097] 🚀「明日もな」原寸JPG配布終了
                0.jpg
                content.html
        \[20230421] [5773097] 🚀「明日もな」原寸JPG配布開始
                0.jpg
                1.jpg
                202304.zip 
                content.html

(Ignore second version↑)

elvis972602 commented 9 months ago

This seems a bit tricky, it would be nice to keep the current file structure, but I'm not sure how it works, whether the current version is likely to be replaced by a newer one, and if so, how it will be named.

elvis972602 commented 9 months ago

There is another problem, this repository relies on api endpoints to fetch articles and images, but this feature doesn't seem to provide that capability. I can only get the current version. This is a fundamental problem, similar to my inability to offer a comment download feature. If the issue remains unresolved, it may be necessary to refactor and switch to using web scraping as a solution

1223334444abc commented 9 months ago

It seems that different versions were added recently, maybe the api will also be updated? I raised the issue of ‘many creators like to delete content’ to them last month. It seems that the current version will be replaced with the update, it is really hard to decide on the naming.

If each version (including the current version) has a unique identifier, maybe the version number can be added to the post ID? (But this would require rebuilding the entire local database.)

图片

elvis972602 commented 9 months ago

<3>(current) and <2> might be the same? maybe these revision id can be used as identifier. Anyway, I think a change in the file structure is inevitable.

1223334444abc commented 9 months ago

In fact, I didn’t see any difference between many versions. If each version has a unique revision ID, I suggest handling it this way:

\[fanbox]noeyebrow
        \[20230421] [5773097] [7968605] 🚀「明日もな」原寸JPG配布終了
                0.jpg
                content.html
        \[20230421] [5773097] [5194635] 🚀「明日もな」原寸JPG配布開始
                0.jpg
                1.jpg
                202304.zip 
                content.html

If the revision ID is added to the folder name, I hope that the local files do not need to be downloaded again.

If folder hierarchy is added, some of my image organization and extraction tools based on hierarchy will become ineffective (also, does adding hierarchy apply to page have only a single version?). (I think this approach may not be good)

\[fanbox]noeyebrow
        \[20230421] [5773097] [7968605] 🚀「明日もな」原寸JPG配布終了
                \[7968605] 🚀「明日もな」原寸JPG配布開始
                        0.jpg
                        content.html
                \[5194635] 🚀「明日もな」原寸JPG配布開始
                        0.jpg
                        1.jpg
                        202304.zip 
                        content.html

The problem is whether the current version also has a unique identifier?

1223334444abc commented 9 months ago

If feasible, perhaps the version without the revision ID can be used as the current version (updated if different from the website), and each old version can be downloaded separately in a folder with a revision ID.

\[fanbox]noeyebrow
        \[20230421] [5773097] 🚀「明日もな」原寸JPG配布終了
                0.jpg
                content.html
        \[20230421] [5773097] [5194635] 🚀「明日もな」原寸JPG配布開始
                0.jpg
                1.jpg
                202304.zip 
                content.html

By doing this, I can preserve the local data as much as possible. In addition, you can add a switch to download the historical versions.

elvis972602 commented 9 months ago

The revision ID should be available as an option in the template for the user to decide, there may still be some unavoidable changes, but it would be the most flexible.

1223334444abc commented 9 months ago

Now there is a question, which is whether the current version is necessarily consistent with the previous version displayed (I haven’t found an actual example yet), and whether the current version has a revision ID? And do posts with only one version have an revision ID? And if the revision ID is not used as the standard, there may be a problem with duplicate names when downloading historical versions.

https://kemono.party/fanbox/user/12707319/post/3378974 (I found an example with more versions)

1223334444abc commented 9 months ago

If the revision ID is used as a template, it seems to be a feasible solution to use a different naming scheme for the current and historical versions!

1223334444abc commented 8 months ago

https://kemono.party/api/swagger_schema

/{service}/user/{creator_id}/post/{post_id}/revisions List a Post's Revisions

This seems to be the solution to the problem. (Once again, it is suggested to retain the existing local database as much as possible without re-downloading all content, just as mentioned earlier, by using different naming schemes for the current version and historical versions.)

mikomikorin commented 8 months ago

I want this feature too, and want to select current revision or all revisions from download option when this feature implemented.

1223334444abc commented 7 months ago

I’ve been waiting for this new feature with bated breath!

mikomikorin commented 7 months ago

I think how to save different period with skipping duplicate revisions (in my opinion),

  1. compare timestamp of edited.
  2. (other way, if timestamp of edited is unknown) compare content, numbers of file and numbers of attachments, and path of file, path of attachments, it's duplicated if all of these are same.

then save older(lower) revision_id of duplicate revisions into my output path.

this is revisions api link from first post of this issue. https://kemono.su/api/v1/fanbox/user/2557134/post/5773097/revisions maybe this will save revision_id 5194635, 5548771 if this feature implemented.

and save current period (this has no revision_id, use current when saving revision_id in path) when different with last revision (or not found revision api). https://kemono.su/api/v1/fanbox/user/2557134/post/5773097

1223334444abc commented 7 months ago
\[fanbox]noeyebrow
        \[20230421] [5773097] 🚀「明日もな」原寸JPG配布終了
                0.jpg
                content.html
        \[20230421] [5773097] [5194635] 🚀「明日もな」原寸JPG配布開始
                0.jpg
                1.jpg
                202304.zip 
                content.html
        \[20230421] [5773097] [5999999] 🚀「明日もな」原寸JPG配布開始
                0.jpg
                1.jpg
                2.jpg
                202304.zip 
                content.html

In my opinion, when downloading data, it is possible to compare the local file with the current version. If they differ, the current version should be overwritten. Additionally, when ‘download historical versions’ is selected, download versions other than the current one and include a ‘revision ID’ in the filename (to provide distinct naming schemes for the current and historical versions).

This download scheme aims to minimize re-downloading existing local data and incorporate historical versions into the database.

https://kemono.party/api/swagger_schema

/{service}/user/{creator_id}/post/{post_id}/revisions List a Post's Revisions

revision api

UnprofessionalProfessional commented 7 months ago

@1223334444abc To reduce wear on SSDs, files could be stored in memory before being written to disk. While in memory, the cached file could be hashed, and that could be compared against what's already on disk (by file name). But this approach would need a configuration option for the maximum size to be cached in memory. When the cache is full, its contents would be written to disk, and the downloader would write to the file instead. There could also be an index file in every directory containing hashes of files (and their names) to save time.

1223334444abc commented 7 months ago

@055642 I just a user. please @ the author of this repository. However, I think the lifespan of SSDs is so high nowadays that there is no need to worry about lifespan issues and only need to consider the reliability and performance of the file comparison method.

Linden10 commented 6 months ago

I see this issue hasn't been resolved...Kemono shows the revision in the url like so: https://kemono.su/fanbox/user/13222022/post/4427620/revision/2502520

I was trying to download posts like that (using the url above as an example) to download the posts past revision but when I try to download it using Kemono Scraper, I get this error:

D:\> .\kemono-scraper.exe --link https://kemono.su/fanbox/user/13222022/post/4427620/revision/2502520

2023/12/14 20:13:08 Error splitting host component:[ fanbox user 13222022 post 4427620 revision 2502520] 8

I'm not sure when this issue will be resolved or how to do it now temporarily but I just wanted to mention this problem atm. Hopefully revision support is added soon!

1223334444abc commented 6 months ago

This issue has a new discovery. When obtaining ‘revision’ in the Kemono API, many historical versions are obtained, but versions with the same editing time are actually consistent. We need to classify them by editing time, set the latest one as the ‘current version’, and obtain the ‘historical versions’ based on the editing time (take the earliest one in each group to obtain ‘revision_id’), so that we can obtain results consistent with the Kemono webpage.

https://kemono.su/fanbox/user/310609/post/3697579/ https://kemono.su/api/v1/fanbox/user/310609/post/3697579/revisions

I hope this feature can be added as soon as possible. Thank you very much.

——————————————

Oh no, it seems that some services do not have an ‘edited’ time.

https://kemono.su/patreon/user/3295915/post/88413981 https://kemono.su/api/v1/patreon/user/3295915/post/88413981/revisions

Now it seems that downloading can only be done by comparing the content.