Hiyori-API / checker_mal

maintains an ID cache for MAL anime/manga, with some complimentary web endpoints
https://purarue.xyz/mal_unapproved/
MIT License
13 stars 0 forks source link

General Plan/Feature List #1

Closed purarue closed 4 years ago

purarue commented 4 years ago

Based off of mal-id-cache, but nicer since I dont have to deal with python async things.

Store state in an db/file locally, see config file for ranges.

General Strategy/improvements from mal_id_cache:

Use search indexed by reverse ID to check ranges of MAL pages, stopping at the page specified by the local database state if we dont find anything.

instead of deleting/rebuilding cache every 'x' days, use the index (this) of all (unapproved/approved entries), and check against the current pages' last ID on the page, to see if we've passed it.

We have to cache the last unapproved anime/manga ID (while its unapproved), since (in mal_id_cache terms), its used as the 'base case' for the -2's page search. If:

The names from the mal index page should be parsed, and compared against the known names. Its possible that entries are merged/deleted on MAL (https://myanimelist.net/forum/?topicid=1795522), so in that case, the Hiyori combiner would be sent some sort of message which describes "this may have changed", but its not certain. Hiyori can use the cached attributes to check against the ID that has changed, to determine if it was just a change to the title, or a merging of entries. Define some enumeration that describes how to deal with deleted/merged entries, as opposied to unapproved -> approved entries, as far as updating the global index. This will never work 100% (since an entry could still change without having the MAL name change, and a name can change without the MAL entry having being merged/replaced), but its better than not implementing it at all.

It may be possible to skip pages while iterating, if you know the next ID is more than 50 IDs back (which would be done by sorting a list of unapproved+approved), see the oldest unapproved manga entries, but this isn't necessary. In the best case, it reduces the requests made by 5 every 10 days or so.

purarue commented 4 years ago

just going to highlight this part because Im often confusing/double-guessing myself as to whether or not caching the last unapproved entry is necessary.

It is not necessary.

We can figure out what the last unapproved item was, by doing the set difference of all the IDs with the approved set.

If what was the last unapproved entry was denied, it no longer exists, so we don't have to check that far back. The next time we calculate the oldest unapproved ID, it'll be the next oldest unapproved entry, which is fine -- since the old one 404s now.

If what was the last unapproved entry was approved, the whole point of checking that as a 'page range' is that we'd find it, so we'd find it when we check the unapproved 'page range' ((-2) in mal-id-cache)'

purarue commented 4 years ago

Again, to dispute the commit message from then:

if the last entry was approved, theres a possibility
that the last entry is not found, since this uses
the last unapproved ID as the loop condition
to go back till when '-2' is the key
for unapproved entries.

at the point at which you're deciding what the last unapproved ID is, if that entry was approved, we're not aware of that -- its not the in the approved set of IDs

so, we assume its unapproved and go back that many pages, find it and mark it approved

next time, we dont go back that far, which is correct