Nandaka / PixivUtil2

Download images from Pixiv and more!
http://nandaka.devnull.zone/
BSD 2-Clause "Simplified" License
2.37k stars 257 forks source link

Fanbox post not downloading html #689

Closed DisasterInbound closed 4 years ago

DisasterInbound commented 4 years ago

Prerequisites

Description

Fanbox post not downloading html

[Description of the bug or feature] When bulk downloading from a supported creator, I noticed at least one of the posts which include some text on it, it was not downloaded as "article" type which is what I believe creates the html file, but instead downloaded as "image" type.

I would like to know if there is a minimum character count in order for the post to be deemed as article?

The post in question has 3 images and 121 characters (without spaces) and 142 characters (with spaces). The text has both japanese and english characters if that makes a difference.

If the "article" type doesn't have anything to do with downloading the html of the post or the minimum character count is okay, what could be the reason for not downloading the html.

Steps to Reproduce

1.Open PixivUtil2 2.Select option F2 3.Entered post ID and download

Expected behavior: [What you expected to happen] Download pictures and html of post.

Actual behavior: [What actually happened] Downloaded only pictures of the post

Versions

You can get this information from executing PixivUtil2.py --help.

I used beta version v20200430-beta2 I was not able to find the PixivUtil2.py but I'm attaching my config file and the log of the post that I was downloading. config.zip

bluerthanever commented 4 years ago

In my understanding, the visual difference between article posts and image posts isthat, if it's a image post, all the texts would be at the end of the post, while for article posts, the texts are scattered around, mostly, like between images, and there are some styles, like in bold or contains hyperlinks or something.

And the post type is actually obtained from FANBOX API, not decided by numbers of characters in the texts.

So mostly if it's a image post, there's no much need to make an html file, I guess?

bluerthanever commented 4 years ago

Is it mainly because if there's an html file, there's no need to go into each folder and browse the files? Haha

DisasterInbound commented 4 years ago

Thank you for your response and background knowledge of posts.

I totally agree that with image posts it shouldn't be necessary to download the html file since most of the time creators post either only pictures or little text that has no meaning in saving it. For some pictures however, I would like to download it as it would bring more background/details about the specific picture.

Now that you have explained the post type, could this issue be changed to a request to make it download html files optionally?

My usage would be as follows:

Read the log on tool and when it says "Downloading post 123" the tool could display "Do you want to download html file? Y/N?"

I would go to the post manually on the browser and decide whether I want to save it or not.

This feature could be disabled by default as to avoid asking this question every time for all users but for those like me that would like to manually check it would be okay to enable it.

DisasterInbound commented 4 years ago

Is it mainly because if there's an html file, there's no need to go into each folder and browse the files? Haha

Are the files that have an html file supposed to be downloaded within their own folders?

In my case, the media files and html files are saved as shown here: /folder/post_abc image1 /folder/post_abc image2 /folder/post_abc image3 /folder/abc.html folder/post_efg image1 ... etc.

It doesn't create a /folder/post_abc folder where it saves the media files.

I'm okay with either choice but I want to confirm if the tool should create a folder for each html file (and within the folder also have the media files downloaded) or if it just leaves it on the general folder for that specific artist_id

bluerthanever commented 4 years ago

Well, folders are controlled by file formats... for me I download images of posts into separate folders because I don't want the images of a post to be everywhere. But sometimes I think it's kinda inconvenient.

Do you use writeimageinfo though? If you do, all the excess texts should be saved in a text file though, but it applies to all images and posts.

Here's some related information: https://github.com/Nandaka/PixivUtil2/issues/669#issuecomment-615208989

DisasterInbound commented 4 years ago

Thank you. I used the file formats and i was able to save the posts per folder.

I tried writeimageinfo as you mentioned to save the text as .txt files however I would prefer to download the html file.

I prefer it because speaking of this specific artist I'm trying to download, he uploads pictures and he either places text in between the pictures or at the end of them where it depicts a conversation between the characters shown. Think of it as a CG comic or manga without having the actual text typed in the picture.

I tried using the .txt file however In order to read and see the picture at the same time it would mean I would have two open both image viewer and txt file at the same time. After trying the html file with other posts were able to be downloaded, I found it much easier.

So I would like to request again if its possible to add a "Save html for this post" option as I mentioned here. You can make it a Fanbox only feature is you decide. And understand if you refuse.

And even though I may not use it, I have a question regarding the .txt file downloaded with writeimageinfo, just to make sure it works properly on my end or not:

I set up file format in the following way:

filenameformat = (%member_id%)\%image_id%_%page_index%_%date%
filenamemangaformat = (%member_id%)\%image_id%\%image_id%_%page_index%_%date%
filenameinfoformat = (%member_id%)\%image_id%_%date%
filenamemangainfoformat = (%member_id%)\%image_id%\%image_id%

So in this way if it's a single-picture post It should be downloaded in the member_id general folder and, if its a manga (multiple picture) post, it will be downloaded on its own folder. I added filenameinfoformat and filenamemangainfoformat using the same syntax rules so the .txt file of a manga would be downloaded within its own image_id folder however it doesn't work that way, it gets downloaded outside on the general folder instead. Did I make a mistake on the formatting above or is it a bug?

One bug i found: When managing the database (to delete the post_id and redownload it) I noticed the text is incorrect when selecting the option. Example:

When choosing F2 which displays Delete Fanbox download history by post_id, the next line shows "member_id?" like if I pressed the F3 option instead. At the same time, If I choose F3, it will display "post_id?" when it is supposed to be member_id. They work properly in the background but the labels on them are reversed.

bluerthanever commented 4 years ago

I added filenameinfoformat and filenamemangainfoformat using the same syntax rules so the .txt file of a manga would be downloaded within its own image_id folder however it doesn't work that way, it gets downloaded outside on the general folder instead.

Is it FANBOX post or regular pixiv manga images that you are trying to download its image information here? I will have a look at it later...

The display.... it should be a type... really sorry. Will fix that.

bluerthanever commented 4 years ago

I will see what I can do about the html as well...

bluerthanever commented 4 years ago

filenameformat = (%member_id%)\%imageid%%pageindex%%date% filenamemangaformat = (%member_id%)\%image_id%\%imageid%%pageindex%%date% filenameinfoformat = (%member_id%)\%imageid%%date% filenamemangainfoformat = (%member_id%)\%image_id%\%image_id%

Hey, I just tested with the formats you provided, and it works like this for me: Type Where the txt file is saved
Regular manga images in the same folder with the images
Fanbox posts outside the folder where the images are

But I have also notice that, the format for filenameinfoformat is (%member_id%)\%image_id%_%date%, which is used when creating FANBOX post image info.... So it's not wrong that it was saved outside the folder I guess..

bluerthanever commented 4 years ago

I prefer it because speaking of this specific artist I'm trying to download, he uploads pictures and he either places text in between the pictures or at the end of them where it depicts a conversation between the characters shown. Think of it as a CG comic or manga without having the actual text typed in the picture.

Are the texts in images in Japanese? And those at the end of the post are English translations? And is that why you need to read them against each other?

But posts with texts between images should be article typed posts, that would not be what this issue is about. The one that is bothering you is the other one. But even if it's written into html file, the text would still be at the end, and you might need to scroll up and down frequently to read them, and the more there are images, the longer the distance. I think it would probably be better that you open the image to the left side of the screen while on the other side you can open the txt file. I am just assuming, but, is that a possibility? And also I have not come up with a really good idea though... except for adding more options in config....

DisasterInbound commented 4 years ago

Is it FANBOX post or regular pixiv manga images that you are trying to download its image information here? I will have a look at it later...

It is Fanbox only. I have not tried normal pixiv manga yet.

Hey, I just tested with the formats you provided, and it works like this for me: But I have also notice that, the format for filenameinfoformat is (%member_id%)\%image_id%_%date%, which is used when creating FANBOX post image info.... So it's not wrong that it was saved outside the folder I guess..

I though that from Fanbox side the api would automatically detect it as 1 picture = single post and 2 or more pictures = manga. Just to make sure i understand your comment correctly, this means all posts on `Fanbox are handled as one picture only?

bluerthanever commented 4 years ago

I though that from Fanbox side the api would automatically detect it as 1 picture = single post and 2 or more pictures = manga. Just to make sure i understand your comment correctly, this means all posts on `Fanbox are handled as one picture only?

However many images there are, they are processed currently in this logic:

DisasterInbound commented 4 years ago

Are the texts in images in Japanese? And those at the end of the post are English translations?

Both. He inserts the text as Japanese and then is followed by the translation in English. I'm not able to give you a proper example at the moment but it would be like

Character 1: "[Insert japanese text here]" "[Insert english translation]" Character 2: "[Insert japanese text]" "[Insert english translation]"

I'll be able to provide an html file from a post as an example in a couple of hours.

And is that why you need to read them against each other?

Since the text is posted japanese+english side by side, I'm okay with how the author is posting it. (I don't use the japanese text but it doesn't affect me leaving it there or removing it).

But posts with texts between images should be article typed posts, that would not be what this issue is about.

Yeah, sorry for the confusion here. Generally speaking, he posts pictures with text on them as he usually post more than 5 pictures at a time. And the characters have a conversation between them.

In this particular post that i made this issue about, he posted 3 pictures and some text at the end, no text in the middle. This in turn made the post labeled as image instead of article which caused the html file to not be downloaded.

The one that is bothering you is the other one. But even if it's written into html file, the text would still be at the end, and you might need to scroll up and down frequently to read them, and the more there are images, the longer the distance. I think it would probably be better that you open the image to the left side of the screen while on the other side you can open the txt file.

The pictures are big in resolution but the way they are inserted in the html (and live directly on the Fanbox website) are spaced properly so you can have a picture and still read the text. It will be better to show you an example so you know what i'm talking about, I'll do it later.

I am just assuming, but, is that a possibility?

Although it can be done, it is just easier to have them on one place and not having them separate.

And also I have not come up with a really good idea though... except for adding more options in config....

That's okay don't worry. I'm thankful for you to take it in consideration and I wouldn't mind waiting if its necesary or if its low priority in comparison to other bug fixes or new features you want to add.

DisasterInbound commented 4 years ago

I though that from Fanbox side the api would automatically detect it as 1 picture = single post and 2 or more pictures = manga. Just to make sure i understand your comment correctly, this means all posts on `Fanbox are handled as one picture only?

However many images there are, they are processed currently in this logic:

* ```
  for post covers it uses `filenameFormat`
  ```

* ```
  for images inside posts it uses `filenameMangaFormat`
  ```

* ```
  for info, and html it uses `filenameInfoFormat`
  ```

Thank you. It was my mistake. I did read this post before but I automatically (wrongly) assumed that since posts with pictures were handled as filenameMangaformat, they would also use filenameinfoManga because well, "Manga" is used on the name of both settings.

I'll change this on my end and let you know about it.

I understand now that filenameinfoManga would be for pixiv manga posts, right?

bluerthanever commented 4 years ago

That seems to be a very unique and customized request... Haha. Well, mayyyyyyyyyyyyyyyyyyyyyyyyyyybe I can write a patch for you? Haha.

Thank you. It was my mistake. I did read this post before but I automatically (wrongly) assumed that since posts with pictures were handled as filenameMangaformat, they would also use filenameinfoManga because well, "Manga" is used on the name of both settings.

Yeah, I was a little confused at first, until I read the codes.

I understand now that filenameinfoManga would be for pixiv manga posts, right?

Correct.

bluerthanever commented 4 years ago

I think I partially implemented your request in #699 . New option items minTextLengthForNonArticle and minImageCountForNonArticle for controlling writing non-article posts. If a non-article post contains text no shorter than the value set in minTextLengthForNonArticle AND images/files no less than the value set in minImageCountForNonArticle, it will be written into HTML, And kinda in another layout, for non-article posts. It's not exactly as you wanted to pause and ask for prompt to download them, and split the text and place them between images though. If you still think the prompt thing is necessary, mayyyyyyyyyyyyybe in the next enhancement?

There are some other new options for in this PR. Please refer to #699. And wait for the next release.

DisasterInbound commented 4 years ago

Thank you very much for this request.

This solution is excellent, I also though about adding a few options likes these to the config file but I was unsure if it was better/easier for you to add this through prompts for each post or adding the setting on the file.

Looking forward to the release and I'll let you know about it!