New parsers for baraag and pawoo

tokumeiii commented 3 years ago

I've overhauled the parsers for baraag and pawoo (and also streamlined the URL classes a bit). The current parser in the repo just grabs the bare image/video URLs from the post feed it's given and associates the username with them. These new parsers grab the post URLs and associate them and their post times with the downloaded media, and also grab hashtags from the posts (which on these websites tend to be content descriptors and not total garbage like Twitter hashtags).

The baraag package includes an extra parser for the webapp view (the URLs with /web/statuses in them) that grabs the normal post URL. This isn't included with the pawoo package because pawoo just serves Hydrus an error about not having Javascript that doesn't include the post URL or anything else useful.

The post parser includes a number of test cases (multiple files attached, hashtags on the post, replies to the post that themselves include media [which isn't grabbed]) and I didn't find anything where the parser behaves undesirably.

floogulinc commented 3 years ago

Huh. I'm not sure why these had their own parsers in the first place since they are included in the mastodon parser.

tokumeiii commented 3 years ago

I've basically changed this into a generic mastodon parser (to the extent that it wasn't already) and have merged in the display name parsing you had in your own copy of the parser. It should be expandable to be used on basically any mastodon instance that hasn't had a major UI overhaul by writing more URL classes and GUGs.

There is that big pile of mastodon URL classes in the repo, but I'm a bit leery of them since I know for a fact that a significant number of them are dead instances (I didn't check anywhere near all of them, but I checked several of them and most of those failed to connect or were being domain-squatted) and of the ones that are still alive approximately none of them except baraag and pawoo have likely ever had a Hydrus user download anything from them ever before. I don't have a very strong opinion about it but I kind of feel like it makes more sense to just provide URL classes and GUGs for ones that people actually want. Not necessarily saying that that's only these two (though I've not really ever heard of any of the others so I have no idea) but at the very least I think that top 100 list would benefit from being pruned.

CuddleBear92 / Hydrus-Presets-and-Scripts

New parsers for baraag and pawoo #104