ZeroQI / Hama.bundle

Plex HTTP Anidb Metadata Agent (HAMA)
GNU General Public License v3.0
1.21k stars 114 forks source link

Cache age change is not being applied #369

Closed EndOfLine369 closed 4 years ago

EndOfLine369 commented 4 years ago

Hmmm.....looks like the cache age change is never actually being applied (variable 'Ended' <> 'ended') and is a good thing. Looks like for AniDB (haven't looked at TVDB) there is an assumption that if an enddate is set, it will actually be done which is not the case (EX below). It just means they know when it will end.

If this feature is truly wanted to be implemented, it should be taking the end date, add time to it (to allow for post series meta wrapup), and only if that date is <now should it extend it. I would also only extend it to 1yr cache if the series has been done for a while. Meta will evolve for a series over time especially if it is part of a long running series and series can easily have 1-2 years between seasons.

Would think:

<?xml version="1.0" encoding="UTF-8"?><anime id="13844" restricted="false">
<type>TV Series</type>
<episodecount>12</episodecount>
<startdate>2019-10-07</startdate>
<enddate>2019-12-23</enddate>
...

Originally posted by @EndOfLine369 in https://github.com/ZeroQI/Hama.bundle/issues/367#issuecomment-554741010

ZeroQI commented 4 years ago

@EndOfLine369 Excellent breakdown. Most of a library is old series, at least for some, so I would like to take the strain out. I think this dynamic cache is the way to go.

sven-7 commented 4 years ago

How would this affect the Plex option to “refresh all metadata”? Does that override this? I’m fully okay with that — it might help prevent the AniDB bans when doing that.

ZeroQI commented 4 years ago

I just pull new data according to how long ago the series/movies aired That's the plan, having recent meta where it needs and delay the rest...

sven-7 commented 4 years ago

That'd be great. I can get thru most of my library on a "refresh all metadata" but eventually end up with a ban unless I switch my IP to a VPN halfway thru.

sven-7 commented 4 years ago

Nothing for this has been implemented yet, right? I did a refresh and was banned after about 150 series.

ZeroQI commented 4 years ago

https://github.com/ZeroQI/Hama.bundle/blob/master/Contents/Code/common.py line 364-375 Line 374 if not file or file_age > cache if ended==None else CACHE_1DAY*364 if ended else CACHE_1DAY*0.9: #last ep aired no title bolean improvement 'or lastEpAiredNoTitle'? Thetvdb doesn't give end date but status? if so could calculate last aired date or since some status are never accurate let's use file age instead:

EndOfLine369 commented 4 years ago

Got a few enhancements/cleanups/reductions done on EOL branch. This will be next on my list. https://github.com/ZeroQI/Hama.bundle/commits/EOL/Contents

ZeroQI commented 4 years ago

@EndOfLine369 I counted 13 to date, so more than a few :D Love the code reduction. Will look in depth latter

EndOfLine369 commented 4 years ago

Pushed into the branch.

EX:
common.LoadFile() - File cached locally - url: 'http://api.anidb.net:9001/httpapi?request=anime&client=hama&clientver=1&protover=1&aid=10901', Filename: 'AniDB\xml\10901.xml', Age: '29.64 days', Limit: '365 days'
[ ] title: Food Wars! Shokugeki no Soma
common.LoadFile() - File cached locally - url: 'http://api.anidb.net:9001/httpapi?request=anime&client=hama&clientver=1&protover=1&aid=11828', Filename: 'AniDB\xml\11828.xml', Age: '0.00 days', Limit: '365 days'
common.LoadFile() - File cached locally - url: 'http://api.anidb.net:9001/httpapi?request=anime&client=hama&clientver=1&protover=1&aid=13244', Filename: 'AniDB\xml\13244.xml', Age: '0.00 days', Limit: '90 days'
common.LoadFile() - File cached locally - url: 'http://api.anidb.net:9001/httpapi?request=anime&client=hama&clientver=1&protover=1&aid=13658', Filename: 'AniDB\xml\13658.xml', Age: '0.00 days', Limit: '90 days'
common.LoadFile() - File cached locally - url: 'http://api.anidb.net:9001/httpapi?request=anime&client=hama&clientver=1&protover=1&aid=14951', Filename: 'AniDB\xml\14951.xml', Age: '0.00 days', Limit: '6 days'

common.LoadFile() - File cached locally - url: 'http://api.anidb.net:9001/httpapi?request=anime&client=hama&clientver=1&protover=1&aid=8312', Filename: 'AniDB\xml\8312.xml', Age: '29.63 days', Limit: '365 days'
[ ] title: Naruto Shippuuden Movie 5

common.LoadFile() - File cached locally - url: 'http://api.anidb.net:9001/httpapi?request=anime&client=hama&clientver=1&protover=1&aid=12661', Filename: 'AniDB\xml\12661.xml', Age: '0.00 days', Limit: '6 days'
[ ] title: Boruto: Naruto Next Generations

common.LoadFile() - File cached locally - url: 'http://api.anidb.net:9001/httpapi?request=anime&client=hama&clientver=1&protover=1&aid=8357', Filename: 'AniDB\xml\8357.xml', Age: '29.65 days', Limit: '365 days'
[ ] title: Kizumonogatari

After review, it is not possible to make the TVDB files smart. Only the series json had any indication it was done or not. All of the other jsons pulled would still have standard cache time so just leaving it as is.

EndOfLine369 commented 4 years ago

The only thing I can think of is possibly putting in the logic directly in the TVDB module. A standard cache pull of the series json, and based on that pull of Ended or not, it adjusts all later json pulls. As it is the only way for the later json pulls to know what was in the series json. Thoughts @ZeroQI?

ZeroQI commented 4 years ago

Seem like a good idea. Ended date should be in a format compatible with the cache is elapsed time Most important is AniDB though as you can be banned...

sven-7 commented 4 years ago

AniDB has a <enddate>line in their xml files. Sadly, you can't base anything off of that unless you are able to get the xml in the first place due to ban.... I suppose a file could be built over time with that info or it could be sourced separately from AniDB somehow?

Examples:

Ongoing series:

<?xml version="1.0" encoding="UTF-8"?><anime id="14819" restricted="false">
<type>TV Series</type>
<episodecount>24</episodecount>
<startdate>2019-10-09</startdate>
<titles>
<title xml:lang="x-jat" type="main">Nanatsu no Taizai: Kamigami no Gekirin</title>

Finished series:

<?xml version="1.0" encoding="UTF-8"?><anime id="9807" restricted="false">
<type>TV Series</type>
<episodecount>25</episodecount>
<startdate>2013-09-22</startdate>
<enddate>2014-03-30</enddate>
<titles>
<title xml:lang="x-jat" type="main">Magi: The Kingdom of Magic</title>
EndOfLine369 commented 4 years ago

@sven-7, AniDB is already handled. As mentioned above https://github.com/ZeroQI/Hama.bundle/issues/369#issuecomment-558455893 The question is on the other sources.

sven-7 commented 4 years ago

Ah, my bad. I missed that one!

EndOfLine369 commented 4 years ago

Btw, feel free to switch your install to the branch if you want to use it as is at this time. Still making more updates aside from this ticket's cache management. It is only somewhat tested so do not be surprised if you run into a crash/error. Use at your own risk. 😄 Still more coming down the pipeline.

EndOfLine369 commented 4 years ago

@sven-7, can you please help test. Either switch your clone to the EOL branch or you can dl the files from here https://github.com/ZeroQI/Hama.bundle/archive/EOL.zip. I'm still mulling over how to code some smart TVDB caching. But AniDB is the big one with the ban triggers.

sven-7 commented 4 years ago

Yep - sorry for the delay, been down sick for a few days. Will start to take a look today/tomorrow.

sven-7 commented 4 years ago

Hi @EndOfLine369 - I tested out a full refresh. I was banned from AniDB after ~138ish series and about 30 minutes in to the refresh.

EndOfLine369 commented 4 years ago

Based on your previous posts, did you incorrectly delete your Data files before the refresh? This does nothing for pulls from scratch. That is unavoidable. I see you stating that alot on other tickets.

sven-7 commented 4 years ago

Hmm. Not this time, but there probably wasn't much data in there anyway as I've probably deleted it within the last week. So.... might as well be from scratch. Bummer.

I'll marginally refresh the library 30-40 series at a time over the next day to build the library back up to full data. Then start giving it another go.

So my understanding is this fix removes the need to purge data before a refresh as it will pull intelligently decide what needs updating and does not based off of series age?

EndOfLine369 commented 4 years ago

You never needed to purge data. And yes, anidb xml pull is based off series age. But is has to have the xml in the first place to determine when it should next pull it.

ZeroQI commented 4 years ago

Shall we count uncached anidb pull numbers and delay further not to be banned Seemingly one packet every 4s should pass long term according https://github.com/ShokoAnime/ShokoServer/issues/379

EndOfLine369 commented 4 years ago

Not true. We already have a 6 sec sleep in the anidb calls. https://github.com/ZeroQI/Hama.bundle/blob/EOL/Contents/Code/AniDB.py#L167 sleep=6 They ban on both volume & frequency.

ZeroQI commented 4 years ago

There is 6s sleep time indeed in common.py for AniDB. Tried editing from my phone but github didn't agree, huge delay...

would lead to bans:

If we have lock working despite multiple thread, it shouldn't ban if there is local files as we put 6s delay BUT it could be pulling AniDB posters and tripping the protection ? Posters download for anidb need to use the locking mechanism and pause too.

sven-7 commented 4 years ago

Slowly refreshing the library one letter of the alphabet at a time to build it back up for testing.

The scanner also pulls from AniDB too, right? Does the scanner and HAMA pulling back-to-back have any effect on this? I typically am not pulling the same thing more than once per 24 hours, so it has to be volume related on my end. Though if the Scanner pull has any effect (I see files in /AppData/Local/Temp/.

I feel like the ban always results after a few big series get pulled too. Inevitably, it ends up pulling Dragon Ball, Dragon Ball Z, Detective Conan, etc. pretty close together.

EDIT: My other thought could be tvdb2/3/4 or anidb 2/3/4 shows. Are those requests getting pulled with the delay or all at the same time

sven-7 commented 4 years ago

I think I may have found something worth looking at. In two instances, I've been banned for a refresh on a single show.

One is Lupin III, which pulls at least 15 AniDB .xml (before banning) that are related to Lupin. Maybe the timeout isn't applying here? I have this series on plain tvdb, with episode formatting as SXEE for all five seasons since the absolute numbering on TVDB is missing for chunks.

The other is The Disastrous Life of Saiki K. This is on TVDB4 with a custom mapping file.

Here are the agent logs. It doesn't look like the scanner leaves log anymore?

Lupin III.agent-update.log The Disastrous Life of Saiki K.agent-update.log

EndOfLine369 commented 4 years ago

Check _root_.agent.log for url calls/times.

sven-7 commented 4 years ago

It looks like there is a 6s delay. The log I posted should be from when I started to refresh the letter 'L' in the alphabet. Lupin requests are at the end. That's the only HAMA log showing up with 'banned' in it.

log.txt

EndOfLine369 commented 4 years ago

If we have lock working despite multiple thread, it shouldn't ban if there is local files as we put 6s delay BUT it could be pulling AniDB posters and tripping the protection ? Posters download for anidb need to use the locking mechanism and pause too.

Lock is working fine. And posters are not the trigger. Perfect example is in @sven-7's "Lupin III" example where the ban triggered in the middle of the meta pull. Well before poster pull. If anything, the posters only add to the leeching threshold. But even then, once a poster is pulled, it is never pulled again.

The scanner also pulls from AniDB too, right? Does the scanner and HAMA pulling back-to-back have any effect on this? I typically am not pulling the same thing more than once per 24 hours, so it has to be volume related on my end. Though if the Scanner pull has any effect (I see files in /AppData/Local/Temp/.

Files are not getting added into Plex by scanning. This is solely a metadata refresh by HAMA so scanner's pulls do not trigger the ban. Not saying it doesn't have the potential to cause it but I have not seen any issues in an xml being pulled twice, once by ASS then again by HAMA.

I feel like the ban always results after a few big series get pulled too. Inevitably, it ends up pulling Dragon Ball, Dragon Ball Z, Detective Conan, etc. pretty close together.

Exactly, its not that we are pulling too fast or same entry repetitively (flooding). Is it a volume issue. Based on my experience, if we pull too many xmls at once, it will hit a ban as we are seen as leeching.

EDIT: My other thought could be tvdb2/3/4 or anidb 2/3/4 shows. Are those requests getting pulled with the delay or all at the same time

Any included AniDB xml pull, no matter the source mode, has the 6 sec sleep.

One is Lupin III, which pulls at least 15 AniDB .xml (before banning) that are related to Lupin. Maybe the timeout isn't applying here? I have this series on plain tvdb, with episode formatting as SXEE for all five seasons since the absolute numbering on TVDB is missing for chunks.

The other is The Disastrous Life of Saiki K. This is on TVDB4 with a custom mapping file.

Just adds to the leeching threshold they monitor. Based on the log you provided, Lupin III includes pulling 40 xmls in ~4min.

It looks like there is a 6s delay. The log I posted should be from when I started to refresh the letter 'L' in the alphabet. Lupin requests are at the end. That's the only HAMA log showing up with 'banned' in it.

The 40 xmls just put it that bit over their leeching threshold.

sven-7 commented 4 years ago

Thanks for such a thorough response.

Just adds to the leeching threshold they monitor. Based on the log you provided, Lupin III includes pulling 40 xmls in ~4min.

The 40 xmls just put it that bit over their leeching threshold.

In sum, it sounds like the bans I'm experiencing are volume based and likely marginally passing the threshold for 4m/40xml.

Do you think the cases for this are fringe enough to not warrant any changes? The 6s delay gets you to exactly 4m/40xml, but maybe that's in a perfect world? It's likely only on library add or first time HAMA usage, that a ban will trigger. I imagine sometimes a ban will and sometimes a ban won't happen. It's probably a matter of circumstances of how the call goes.

EndOfLine369 commented 4 years ago

The only thing I can think we can do to prevent a leeching ban is keep a running total of xml pulls over the last running 1 hr and be a bouncer. Lock out any new requests until we see less than X (maybe 100?) within the last hour then allow more and keep the hourly volume in check. I don't know how Plex will handle those refresh calls taking so long as they were potentially locked out for an extended time.

sven-7 commented 4 years ago

I think that's a creative solution to the problem. I suppose it's a matter of identifying what the time to xml ratio/threshold is. I think 100 within an hour seems reasonable.

The initial library load will take a while, but better than a ban. Especially because each time you do a refresh, you'll likely only refresh the same part of the library and the later parts never get AniDB data.

EndOfLine369 commented 4 years ago

The other thing is to just start rejecting to process refreshes once that limit it reached instead of locking out pending requests. Plex would probably like that better. Problem would be how to let the user know (without having them look at the logs) that that is what happened when they see no metadata loaded or updated.

sven-7 commented 4 years ago

The other thing is to just start rejecting to process refreshes once that limit it reached instead of locking out pending requests. Plex would probably like that better.

In that model, would the refresh pick back up once the limit is over?

Problem would be how to let the user know (without having them look at the logs) that that is what happened when they see no metadata loaded or updated.

Agree that's a tough part to document. We could list it in the readme, but I feel like people often skip over that. Still - I think it's a good throttle to have implemented.

EndOfLine369 commented 4 years ago

The other thing is to just start rejecting to process refreshes once that limit it reached instead of locking out pending requests. Plex would probably like that better.

In that model, would the refresh pick back up once the limit is over?

You would just have to do another refresh call after that 1 hr. The earlier series refreshed will not repull the xml and all meta should match as previous. It would only start again where it left off when it started rejecting to do more work.

Problem would be how to let the user know (without having them look at the logs) that that is what happened when they see no metadata loaded or updated.

Agree that's a tough part to document. We could list it in the readme, but I feel like people often skip over that. Still - I think it's a good throttle to have implemented.

Yeah, they are more likely to not see it in the readme and post in the forum of create a ticket here for a bau behavior. Would have to possibly do something on the title to make it abs clear. And I could see issues/complaints in trying to do that.

EndOfLine369 commented 4 years ago

@ZeroQI, thoughts on either of these two potential approaches to prevent leeching bans? Leaning towards just putting the refresh calls on hold till the headroom has cleared for more to go through.

ZeroQI commented 4 years ago

I would lean towards that too but would search for the way to avoid ban, but only if we cannot avoid it. 1- is there a 6s delay in the scanner for anidb files if pulled? Could it use cache only if available? 2- is there a magical value that allow the agent to run for long series aka 12s or more? By mesuring the numer of series and time to get banned, we could discover the sweet spot value to avoid being banned 3-can we generate the message that plex display about update status ? That is why i want to cache the most, to avoid ban at all cost

sven-7 commented 4 years ago

@EndOfLine369 - in your current theory, what would happen if the limit hits in the middle of a series? Would it finish that one out and refuse the next or drop midway thru?

Having some sort of sliding scale might be interesting based off of series length and content (predicted or known) in the xml. For series >26eps it's 6s, for series between 27-59 it's 8s, 60-99 it's 10s, and 100+ it's 12s?

I was being really careful today, but got banned again. I think it was just sheer volume within an hour (even tho I had breaks of several minutes between batches).

EndOfLine369 commented 4 years ago

I would lean towards that too but would search for the way to avoid ban, but only if we cannot avoid it. 1- is there a 6s delay in the scanner for anidb files if pulled? Could it use cache only if available?

Yep, 6s. https://github.com/ZeroQI/Absolute-Series-Scanner/blob/master/Scanners/Series/Absolute%20Series%20Scanner.py#L184

2- is there a magical value that allow the agent to run for long series aka 12s or more? By mesuring the numer of series and time to get banned, we could discover the sweet spot value to avoid being banned

Asked and will never be answered a year ago. It is not in their interest to ever detail their threshold/methods. https://anidb.net/forum/thread/85427

3-can we generate the message that plex display about update status ? That is why i want to cache the most, to avoid ban at all cost

They don't do the alert message popups anymore for a while now. You have to directly look at the alerts page. But even then, don't see anything in their python so only their compiled code has that function.

@EndOfLine369 - in your current theory, what would happen if the limit hits in the middle of a series? Would it finish that one out and refuse the next or drop midway thru?

It would just pause midway through and continue on once headroom is freed up.

Having some sort of sliding scale might be interesting based off of series length and content (predicted or known) in the xml. For series >26eps it's 6s, for series between 27-59 it's 8s, 60-99 it's 10s, and 100+ it's 12s?

A sliding scale would still potentially hit the leeching threshold. Best to let it run as fast as it can if it can stay under that limit as bau.

I was being really careful today, but got banned again. I think it was just sheer volume within an hour (even tho I had breaks of several minutes between batches).

You can look at your "DataItems/AniDB/xml" folder and see how many and in what times you got before you got banned.

sven-7 commented 4 years ago

Asked and will never be answered a year ago. It is not in their interest to ever detail their threshold/methods. https://anidb.net/forum/thread/85427

Yeah. Here, they say 300 (within a day) should be fine. That confirms for me we're fine on the 6s part, but it's the overall number of xml over an unknown time period that we're up against. https://anidb.net/forum/thread/92150

ZeroQI commented 4 years ago

there is either a day limit for requests or a continuous limit per hour and we need to raise the timing until we figure out the numbers...

On Thu, Dec 5, 2019, 2:08 PM sven-7 notifications@github.com wrote:

Asked and will never be answered a year ago. It is not in their interest to ever detail their threshold/methods. https://anidb.net/forum/thread/85427

Yeah. Here, they say 300 (within a day) should be fine. That confirms for me we're fine on the 6s part, but it's the overall number of xml over an unknown time period that we're up against. https://anidb.net/forum/thread/92150

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ZeroQI/Hama.bundle/issues/369?email_source=notifications&email_token=ABHMWZNQU5NO5OA4OVVMG4DQXEDMJA5CNFSM4JOLZOOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGA2D2I#issuecomment-562143721, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABHMWZKFLYUAD3NOPHK7G6TQXEDMJANCNFSM4JOLZOOA .

sven-7 commented 4 years ago

@EndOfLine369

You can look at your "DataItems/AniDB/xml" folder and see how many and in what times you got before you got banned.

Here is some data from yesterday over the course of 120 minutes:

So right now, it could be an average of 100/hour and I broke that?

EndOfLine369 commented 4 years ago

Throttle option added to LoadFile() AniDB is set for max of 100 over 1 hr.

EndOfLine369 commented 4 years ago

Enhanced HAMA: Reduce AniDB xml pulls via mappingList['possible_anidb3'] Reduces the AniDB xml pulls for anidb3 setups. Since it is translated and added in as 'tvdb', it could also possibly be a plain 'tvdb' entry when seen by HAMA. So we only want to pull all AniDB xmls on a possible 'anidb3' entry. So 'possible_anidb3' is defined if an episode number of >100 is found which is what ASS will add it in as.

EndOfLine369 commented 4 years ago

Tweaked the cache age some more. Cache limit age cap increased and caculated instead of fixed values 2 years didn't see old enough to require a 1 year cache. Series ended age has been increased to 5 years of a cap of 1 year cache. And the series <5yrs ended cache limit is now dynamically calculated based on the percentage of where it is within that 5 years.

EX (see 'Limit:'):

common.LoadFile() - File cached locally - Filename: 'AniDB\xml\10901.xml', Age: '2.92 days', Limit: '307 days', url: 'http://api.anidb.net:9001/httpapi?request=anime&client=hama&clientver=1&protover=1&aid=10901'
[ ] title: Food Wars! Shokugeki no Soma
common.LoadFile() - File cached locally - Filename: 'AniDB\xml\11828.xml', Age: '2.92 days', Limit: '234 days', url: 'http://api.anidb.net:9001/httpapi?request=anime&client=hama&clientver=1&protover=1&aid=11828'
common.LoadFile() - File cached locally - Filename: 'AniDB\xml\13244.xml', Age: '0.02 days', Limit: '144 days', url: 'http://api.anidb.net:9001/httpapi?request=anime&client=hama&clientver=1&protover=1&aid=13244'
common.LoadFile() - File cached locally - Filename: 'AniDB\xml\13658.xml', Age: '0.02 days', Limit: '106 days', url: 'http://api.anidb.net:9001/httpapi?request=anime&client=hama&clientver=1&protover=1&aid=13658'
common.LoadFile() - File cached locally - Filename: 'AniDB\xml\14951.xml', Age: '0.02 days', Limit: '6 days', url: 'http://api.anidb.net:9001/httpapi?request=anime&client=hama&clientver=1&protover=1&aid=14951'
EndOfLine369 commented 4 years ago

Let me know if you can think of a better way to space out the limit better pragmatically. And if you see issues in AniDB bans from a full library refresh with the throttle limiter now in place.

sven-7 commented 4 years ago

I’ll try to test this today or tomorrow.

So series of 5+ years has a 1 year cache. How are the others breaking down percentage wise? If it’s 2.5 years old it’s essentially a six month cache? Does that make ongoing series close to or just a bit more than a daily cache?

EndOfLine369 commented 4 years ago

Anything still yet to end or <=30 days since end is still at the 6day cache Starting at 31 days is when it starts to be >6 days. So we don't want it to go lower than that 😄 .

      if   days_old > 1825:  cache = CACHE_1DAY*365                  # enddate > 5 years ago = 1 year cache
      elif days_old >   30:  cache = (days_old*CACHE_1DAY*365)/1825  # enddate > 30 days ago = days_old/2 (days_old/5yrs ended = x/1yrs cache)

>>> days_old=30
>>> CACHE_1DAY=86400
>>> (days_old*CACHE_1DAY*365)/1825/CACHE_1DAY
6.0
>>> days_old=31
>>> (days_old*CACHE_1DAY*365)/1825/CACHE_1DAY
6.2
sven-7 commented 4 years ago

Okay, cool. I’ll try it out and see what happens.

EndOfLine369 commented 4 years ago

Sure, let me know how it goes fellow insomniac 😈