Simon-Tesla / RaccoonyWebEx

A WebExtension that adds shiny features to art sites
MIT License
39 stars 4 forks source link

Other Danbooru-based sites? Specifically danbooru.2chan.jp #179

Open Joebugg opened 1 month ago

Joebugg commented 1 month ago

Is your feature request related to a problem? Please describe. Just an unimplemented site.

Describe the solution you'd like Ability to automatically save metadata and not just right-click to save functionality.

Describe alternatives you've considered Adding the support myself, then submitting changes. Saving page with images (builtin browser function).

Additional context I was doing an image search on Google and found source of ancient images no where else. I noticed that this Japanese Danbooru-based site wasn't supported. I'm not sure even how popular it is, or if worth supporting. I just figured you might want to know about it, and other users might be able to benefit.

http://jun.2chan.net/script/ seems to be source code of the site's PHP, if that is useful. No, wait, haha, this is from 2005. There's no way this site would still be functioning with code that old and not be getting hacked every day.

BTW: Github still has that annoying issue (https://github.com/testing-library/user-event/issues/1075 and related) that you have to press tab to get to "submit" button and press enter instead of just left-clicking. Otherwise it times out.

Simon-Tesla commented 1 month ago

Due to the way site plugins are implemented currently, each plugin needs to register as a handler for a specific URL or set of URLs (e621 is registered as the handler for e926 as well, for instance), so even if multiple sites use the same underlying booru software, each one would need to be registered individually.

It might make sense to see if there's a good way to allow Raccoony plugins to register as a more generic handler with some basic code to run that looks for some quick 'tell' in the DOM to recognize a given type of site so that it can scrape richer metadata off any site that uses that booru software, but it'd take a bit of doing.

Joebugg commented 1 month ago

TBF, looking at that sight closer, I think they just kept the code as simple as possible. So it might not be far from the actual PHP source they use. I could see them doing just security fixes, under the theory that complexity builds bugs. Also, the Rule of Lazy. ;)

So, I checked e621 and there's all of 2 posts that mention that site as a source, and 1 is ancient. This is definitely not a popular booru outside of Japan! I doubt code written for recent boorus will even work correctly on that site. Ironically, it's too basic? It just has this links as endpoints (right term?):

https://danbooru.2chan.jp/index.php?page=post&s=view&id= <post ID>

html body div#content div#post-view div#right-col.content div div#note-container img#image
(For https://danbooru.2chan.jp/images/<load distribution hash>/<file data hash>.EXT

https://danbooru.2chan.jp/index.php?page=post&s=view&id=<post ID>#

https://danbooru.2chan.jp/index.php?page=history&type=tag_history&id=<post ID>

html body div#content div#post-view div#right-col.content div div#note-container div#c<number>
(Annoyingly, the comments are inside the image container?)

html body div#content div#post-view div.sidebar div#tag_list  (Tags)

Not sure if the notes/tag history is worth worrying about. "Posted on" is followed by upload date. This is not in the header, but the tags are. The artist name is just another tag and doesn't have a category like e621 uses. I could just list all the artist tags in a file and search if they're in the tags list but that seems like a bad idea. There is literally no list of artist tags. It's all just tossed in like beach sand.

XPaths:

//*[@id="image"]  (Normal link, sometimes 'original' is the same as viewed)
/html/body/div[3]/div/div[2]/div/div/a[2]  (If 'original' link exists it might not be the same as viewed)
//*[@id="c<number>"] (Comments)
//*[@id="tag_list"]  (Tags)

From looking at the source of the page, it looks like the relevant metadata to save would be the tags, the posted date, image URL/s, and comments? I guess we wouldn't get an artist tag. Unavoidable it seems unless they have a list of them somewhere on the site. :( The comment fields are c followed by 1-2 digits. c3 c4...c40 and so on. There's not actually that many comments for the entire site so not losing much!

Joebugg commented 1 month ago

LOL, at this point I'd say it's simpler to just save the HTML to a text file. Yeesh, this one doesn't actually have much to parse.

Simon-Tesla commented 1 month ago

Yeah, that's often the problem with a lot of these sorts of sites, there's not a ton of metadata to scrape in the first place. Looks like the most you'd get out of this is the list of tags, in terms of structured data Raccoony currently supports.