eoan-ermine / e621-py

A feature-rich high-level e621 and e926 API wrapper.
MIT License
19 stars 4 forks source link

Add blacklist matching by name alias #4

Open Hmiku8338 opened 2 years ago

Hmiku8338 commented 2 years ago

Currently, blacklist is applied to posts in a naive way -- assuming the user inputted all the tags correctly. However, it does not take the tag aliases into account. The reason we haven't implemented it with name aliases to begin with is speed: e621 does not give us a fast way to query a multitude of name aliases (for example, with a single request). So if a blacklist contains 15 tags, it will require us to do 15 web requests. So even if we cache the correct blacklist, it will still take quite a bit of time to filter any number of initial posts.

So the request is simple: Implement some fancy way of doing it quickly :)

Deer-Spangle commented 2 years ago

What would even be the way to speed that up? Feels like there's two options:

Hmiku8338 commented 2 years ago

Your first suggestion would work on a long-running web server but not in a library due to the sheer number of aliases. Each installation of the library would require us to make a ton of requests. We could, however, host a the mapping elsewhere, update it as often as we want, and load it with a single request. However, it still depends on the size of the mapping. If it's 10-20 MB, it would be a bit too large.

About your second point: e621 might already provide such functionality, we just need to look in the right place or ask the right people because the official documentation doesn't list anything like that.

Deer-Spangle commented 2 years ago

Yeaah, doing it in the library would probably be a bit much. There's 75 per page on their listing endpoint (listing endpoint doesn't seem to be in this lib btw?) And there's 697 pages. So that would be 700 requests to load them all up, and 52k tag aliases to store, about 15MB of raw JSON, ouch. And then you need to be checking the search endpoint by update time to keep the list up to date...

Maybe doable by a higher level application, but yeah. Too much for the library.

Yeah, the e621 docs are not great. I'm surprised you got so much API coverage in this lib though, wow! It totally tops the rankings in how much of their API is covered, loads of undocumented bits, thank you!

I could go file an issue on their repo asking about a batch get endpoint, if you want? Or a wildcard or tag list for the search endpoint I guess

Hmiku8338 commented 2 years ago

I think all of our endpoints support listing. Here's the interface of TagAliases: image

Yeah, that would be cool if you created the issue. But be sure to ask whether there exists such a method already.

Deer-Spangle commented 2 years ago

Yeah, I saw the model has get and search. I guess list is just search with no parameters defined, I didn't consider that!

Looking in e621ng, they have a method to convert a blacklist to an aliased query but there's no endpoint for doing that. The endpoints are in the TagAliasesController and I can see there that only index and show are publicly available. (For admins, there's an update, delete and approve endpoint also. Also an edit endpoint that doesn't seem functional?)

I'm not too sure what the endpoint would be for such a thing, I mean, doing one that just calls TagAlias.to_aliased_query(param['query']) would be stellar, but I'm a little unsure what the endpoint would look like, and how to wire that in.. I'll go make a ticket there! (I've seen endpoints using a colon-verb kinda syntax in other APIs in the past, (e.g. /tag_aliases.json:batchGet?ids=1,2,3) not sure whether it's good practice though. That would mean an endpoint something like /tag_aliases.json:convertQuery?query=-deer or whatever

Hmiku8338 commented 2 years ago

Problem solved :)

One of us will need to open and merge the pull request to add this into e621-py

Deer-Spangle commented 2 years ago

Yeah! I guess every endpoint already has batch get, that's handy!

Does the library need updating in a deeper manner, to support passing a list of IDs to all endpoints? Or does it just need some type hint updates?

Hmiku8338 commented 2 years ago

I'm not sure whether we'll need a lot of changes (because I'm not very well informed about the interface of batch get). But I'm pretty sure that the changes will be limited to concrete endpoint classes and the abstract class will not need to change so no deep changes.

If you want to do it yourself, I can review your code. If not, I can implement it later when I have free time (which I don't have a lot of lately)

If you have any questions -- feel free to ask.