Open 1223334444abc opened 9 months ago
When encountering versions with different titles, download all of them. When encountering versions with the same title, download the version with the most files and images. This seems to be a rather crude but feasible solution.
This solution ensures downloading all files and maintaining the original directory structure, but ignores the possibility of the creator modifying images (such as adding overlay modifications).
\[fanbox]noeyebrow
\[20230421] [5773097] 🚀「明日もな」原寸JPG配布終了
0.jpg
content.html
\[20230421] [5773097] 🚀「明日もな」原寸JPG配布開始
0.jpg
1.jpg
202304.zip
content.html
(Ignore second version↑)
This seems a bit tricky, it would be nice to keep the current file structure, but I'm not sure how it works, whether the current version is likely to be replaced by a newer one, and if so, how it will be named.
There is another problem, this repository relies on api endpoints to fetch articles and images, but this feature doesn't seem to provide that capability. I can only get the current version. This is a fundamental problem, similar to my inability to offer a comment download feature. If the issue remains unresolved, it may be necessary to refactor and switch to using web scraping as a solution
It seems that different versions were added recently, maybe the api will also be updated? I raised the issue of ‘many creators like to delete content’ to them last month. It seems that the current version will be replaced with the update, it is really hard to decide on the naming.
If each version (including the current version) has a unique identifier, maybe the version number can be added to the post ID? (But this would require rebuilding the entire local database.)
<3>
(current) and <2>
might be the same? maybe these revision id can be used as identifier. Anyway, I think a change in the file structure is inevitable.
In fact, I didn’t see any difference between many versions. If each version has a unique revision ID, I suggest handling it this way:
\[fanbox]noeyebrow
\[20230421] [5773097] [7968605] 🚀「明日もな」原寸JPG配布終了
0.jpg
content.html
\[20230421] [5773097] [5194635] 🚀「明日もな」原寸JPG配布開始
0.jpg
1.jpg
202304.zip
content.html
If the revision ID is added to the folder name, I hope that the local files do not need to be downloaded again.
If folder hierarchy is added, some of my image organization and extraction tools based on hierarchy will become ineffective (also, does adding hierarchy apply to page have only a single version?). (I think this approach may not be good)
\[fanbox]noeyebrow
\[20230421] [5773097] [7968605] 🚀「明日もな」原寸JPG配布終了
\[7968605] 🚀「明日もな」原寸JPG配布開始
0.jpg
content.html
\[5194635] 🚀「明日もな」原寸JPG配布開始
0.jpg
1.jpg
202304.zip
content.html
The problem is whether the current version also has a unique identifier?
If feasible, perhaps the version without the revision ID can be used as the current version (updated if different from the website), and each old version can be downloaded separately in a folder with a revision ID.
\[fanbox]noeyebrow
\[20230421] [5773097] 🚀「明日もな」原寸JPG配布終了
0.jpg
content.html
\[20230421] [5773097] [5194635] 🚀「明日もな」原寸JPG配布開始
0.jpg
1.jpg
202304.zip
content.html
By doing this, I can preserve the local data as much as possible. In addition, you can add a switch to download the historical versions.
The revision ID should be available as an option in the template
for the user to decide, there may still be some unavoidable changes, but it would be the most flexible.
Now there is a question, which is whether the current version is necessarily consistent with the previous version displayed (I haven’t found an actual example yet), and whether the current version has a revision ID? And do posts with only one version have an revision ID? And if the revision ID is not used as the standard, there may be a problem with duplicate names when downloading historical versions.
https://kemono.party/fanbox/user/12707319/post/3378974 (I found an example with more versions)
If the revision ID is used as a template, it seems to be a feasible solution to use a different naming scheme for the current and historical versions!
https://kemono.party/api/swagger_schema
/{service}/user/{creator_id}/post/{post_id}/revisions List a Post's Revisions
This seems to be the solution to the problem. (Once again, it is suggested to retain the existing local database as much as possible without re-downloading all content, just as mentioned earlier, by using different naming schemes for the current version and historical versions.)
I want this feature too, and want to select current revision or all revisions from download option when this feature implemented.
I’ve been waiting for this new feature with bated breath!
I think how to save different period with skipping duplicate revisions (in my opinion),
then save older(lower) revision_id of duplicate revisions into my output path.
this is revisions api link from first post of this issue. https://kemono.su/api/v1/fanbox/user/2557134/post/5773097/revisions maybe this will save revision_id 5194635, 5548771 if this feature implemented.
and save current period (this has no revision_id, use current when saving revision_id in path) when different with last revision (or not found revision api). https://kemono.su/api/v1/fanbox/user/2557134/post/5773097
\[fanbox]noeyebrow
\[20230421] [5773097] 🚀「明日もな」原寸JPG配布終了
0.jpg
content.html
\[20230421] [5773097] [5194635] 🚀「明日もな」原寸JPG配布開始
0.jpg
1.jpg
202304.zip
content.html
\[20230421] [5773097] [5999999] 🚀「明日もな」原寸JPG配布開始
0.jpg
1.jpg
2.jpg
202304.zip
content.html
In my opinion, when downloading data, it is possible to compare the local file with the current version. If they differ, the current version should be overwritten. Additionally, when ‘download historical versions’ is selected, download versions other than the current one and include a ‘revision ID’ in the filename (to provide distinct naming schemes for the current and historical versions).
This download scheme aims to minimize re-downloading existing local data and incorporate historical versions into the database.
https://kemono.party/api/swagger_schema
/{service}/user/{creator_id}/post/{post_id}/revisions List a Post's Revisions
revision api
@1223334444abc To reduce wear on SSDs, files could be stored in memory before being written to disk. While in memory, the cached file could be hashed, and that could be compared against what's already on disk (by file name). But this approach would need a configuration option for the maximum size to be cached in memory. When the cache is full, its contents would be written to disk, and the downloader would write to the file instead. There could also be an index file in every directory containing hashes of files (and their names) to save time.
@055642 I just a user. please @ the author of this repository. However, I think the lifespan of SSDs is so high nowadays that there is no need to worry about lifespan issues and only need to consider the reliability and performance of the file comparison method.
I see this issue hasn't been resolved...Kemono shows the revision in the url like so: https://kemono.su/fanbox/user/13222022/post/4427620/revision/2502520
I was trying to download posts like that (using the url above as an example) to download the posts past revision but when I try to download it using Kemono Scraper, I get this error:
D:\> .\kemono-scraper.exe --link https://kemono.su/fanbox/user/13222022/post/4427620/revision/2502520
2023/12/14 20:13:08 Error splitting host component:[ fanbox user 13222022 post 4427620 revision 2502520] 8
I'm not sure when this issue will be resolved or how to do it now temporarily but I just wanted to mention this problem atm. Hopefully revision support is added soon!
This issue has a new discovery. When obtaining ‘revision’ in the Kemono API, many historical versions are obtained, but versions with the same editing time are actually consistent. We need to classify them by editing time, set the latest one as the ‘current version’, and obtain the ‘historical versions’ based on the editing time (take the earliest one in each group to obtain ‘revision_id’), so that we can obtain results consistent with the Kemono webpage.
https://kemono.su/fanbox/user/310609/post/3697579/ https://kemono.su/api/v1/fanbox/user/310609/post/3697579/revisions
I hope this feature can be added as soon as possible. Thank you very much.
——————————————
Oh no, it seems that some services do not have an ‘edited’ time.
https://kemono.su/patreon/user/3295915/post/88413981 https://kemono.su/api/v1/patreon/user/3295915/post/88413981/revisions
Now it seems that downloading can only be done by comparing the content.
https://kemono.party/fanbox/user/2557134/post/5773097
Kemono seems to offer posts from different periods. As shown in the link above, attachments have been removed in newer versions. I hope to download all versions to ensure file integrity. I hope to complete the historical versions without affecting the existing image database. If there is a unique numerical identifier or tag to distinguish versions?
Downloading only the latest or only the oldest posts is not appropriate, as some authors prefer to add new content while others prefer to delete content. It is necessary to save all versions, but different versions of posts may have the same name.
I have been thinking for a long time but haven’t come up with a good naming solution (without affecting the current database content, especially the folder hierarchy).
The current folder and file structure: