danmarsden / moodle-plagiarism_urkund

Ouriginal plagiarism plugin for Moodle (previously called Urkund)
https://ouriginal.com/
12 stars 19 forks source link

Don't understand why debug fills up with 202 'errors' #61

Closed thepurpleblob closed 8 years ago

thepurpleblob commented 8 years ago

I came in this morning to c. 20 pages of submissions in 202 state and complaints of scores/reports not being available.

On this occasion, clicking the 'Get scores' button seems to have got scores for most of them. However, I'm not at all happy about this. Isn't the plugin meant to keep trying for outstanding scores until they become available? Or have I misunderstood how it works? Also, what's special about these submissions that their reports/scores don't come back without manual intervention.

If this had been over a weekend we would have been very unpopular.

Can something be done to make this more robust?

danmarsden commented 8 years ago

I wonder if you are just looking at the debugging page too frequently? - Possible that you had a massive number of submissions overnight and URKUND hadn't processed them all quite yet?

It checks for the scores on an increasing time delay - but the debugging page "overrides" the standard time delay and lets you "check now" - Here's a breakdown on when each attempt should be getting checked:

attempt 1: 5 minutes attempt 2: 11 minutes attempt 3: 18 minutes attempt 4: 27 minutes attempt 5: 37 minutes attempt 6: 50 minutes attempt 7: 65 minutes attempt 8: 82 minutes attempt 9: 104 minutes attempt 10: 130 minutes attempt 11: 161 minutes attempt 12: 198 minutes attempt 13: 980 minutes attempt 14: 1377 minutes attempt 15: 1932 minutes attempt 16: 2710 minutes attempt 17: 3799 minutes

danmarsden commented 8 years ago

if you continue to get them backing up it would be good to know what the "attempt" value for those records is.

thepurpleblob commented 8 years ago

They varied but a significant number were over 24 hours old. They all said "attempt 19".

We only get to know when somebody starts to panic about approaching submission deadlines and no Urkund results.

Seems Urkund had some technical issues yesterday but that (in my head anyway) shouldn't break it. My view is that the lengthier retries are too long. Under what circumstances would it be appropriate to back off for 3 days (ish)? By the time we are looking at attempt 10 (c. a couple of hours) I really would prefer the system to alert me to there being a problem (in our experience Urkund don't always know either). Unless you have a very good reason for doing otherwise, I would suggest that a maximum retry period in the region of 2-3 hours should be the max.

danmarsden commented 8 years ago

if you got to attempt 19 in 24 hours they may be running faster than I planned.. once it gets that high it will stop checking assuming that something else has gone wrong. - URKUND asked for the last check to happen at 36hrs after initial submission and then to "stop" trying unless someone manually runs the get scores check on the debugging page. The old version of the code made this check less frequently and made a smaller number of checks.

danmarsden commented 8 years ago

attempt: 0, wait: 30 attempt: 1, wait: 60 attempt: 2, wait: 120 attempt: 3, wait: 240 attempt: 4, wait: 480 attempt: 5, wait: 960 attempt: 6, wait: 1920 attempt: 7, wait: 3360 attempt: 8, wait: 4800 attempt: 9, wait: 6240

danmarsden commented 8 years ago

That's the values (in minutes) that the old version of the code checked for an available score/report.

danmarsden commented 8 years ago

I used a quick patch/hack to change the values to the new checking times (which is why they are not even numbers)

danmarsden commented 8 years ago

arg - sorry, URKUND asked for the last check to be the one that occurs at least 48hrs after the intial submission (which should be attempt 17 at 63hrs) - earlier version of the code in testing made the last attempt number 20

thepurpleblob commented 8 years ago

I can see what they are thinking but, unfortunately, I think something is going wrong. My guess is that while they had server problems it must have run up to 19 way faster than it should have done.

danmarsden commented 8 years ago

it doesn't rely on URKUND status to determine how often it runs the checks - if it's happening more frequently it's more likely to be my bad code.

danmarsden commented 8 years ago

... looking into it.

thepurpleblob commented 8 years ago

I'm just looking at it... call me old fashioned but I would probably have just coded a lookup table for the times :-D

danmarsden commented 8 years ago

yeah - the original code with original timings made more sense - I didn't want to rewrite it all to match the timings URKUND were looking for but that's made the code more complicated than it should be.

danmarsden commented 8 years ago

looks like the normal code does the last check at 16 attempts instead of 17 and it "should" change the statuscode in the table from "202" to "timeout" when it hits that value. I'm not quite sure how you got to "19" unless you're using a much older version of the code when URKUND_MAX_STATUS_ATTEMPTS was set to 20 in urkund/lib.php

danmarsden commented 8 years ago
thepurpleblob commented 8 years ago

I might be getting confused. I had a fair few 'timeout' status with attempt=19 so I'm probably looking at those. The 202s had attempt numbers of some sort.

My assumption is that all that is happening is that the server goes offline (which it did) for a few hours and the submissions get stuffed. My issue (in that case) is that if you miss 3.3 hours then the next check is 16 hours! I appreciate that Urkund don't want their servers hammered but 16 hours is crazy.

A student submitting 12 hours before a submission deadline, but in the middle of a 4 hour server outage) will miss their deadline. This equals trouble. If it checked every 2 hours then it would all be fine more-or-less.

danmarsden commented 8 years ago

yeah - time to rewrite the attempt code and make it simpler :+1: I'm not sure if I'll get to that this side of Christmas though.

danmarsden commented 8 years ago

(you could probably change this on your own version now by decreasing URKUND_MAX_STATUS_DELAY to 2hrs and increasing URKUND_MAX_STATUS_ATTEMPTS appropriately.)

nomadwarmachine commented 8 years ago

I'm not sure you realise how high stakes this is for us. At a Uni with 26K FTE telling us to just change a line of code is not a solution. The sign off process to implement a new version of a plugin is hectic. We rushed one through to update to your latest version. Finding out it has these problems is worrying.

danmarsden commented 8 years ago

I would also point out that this particular discussion relates to an issue that would likely be worse using the older code because the number of checks were much lower and less frequent. All code has issues - particularly pre-release code that has been tested by a smaller group of users.

thepurpleblob commented 8 years ago

Yep - I think we are getting there though. I have also emailed Urkund to open a discussion about this. There is clearly some tension surround the retry regime. They won't want huge numbers of retries but we need to get our scores back :)

I am certainly not above hacking the code but, clearly, I would prefer not to.

They should probably have implemented this as a callback and introduced a whole different bunch of problems!!

danmarsden commented 8 years ago

cool - we have an internal Work Request related to the timing changes in the current version - URKUND asked for these to be improved (more frequent/more checks) in the latest version (from 9 attempts in the old version to 17 in the current one) - but I'm sure they will be open to modifying it further.

thepurpleblob commented 8 years ago

Just an update... we are still seeing loads of 202 states in debugging and complaints about delayed reports, so it seems not to be working properly at all.

Again, clicking the 'get scores' button clears it up. So, effectively, we are working the queue manually at the moment.

I couldn't be bothered counting them, but there must have been over 200 '202s' this morning.

danmarsden commented 8 years ago

cool - can you please confirm what the attempt value is for those records - do they vary? I presume "some" reports are coming back automatically? - I wonder what the processing time is like on Urkunds side at the moment.

thepurpleblob commented 8 years ago

Actually... these seem to have fixed themselves. All gone. Urkund have been claiming that they are very lightly loaded this week.

Oh well, I'm possibly just getting paranoid.

danmarsden commented 8 years ago

hehe - all good, I'll close this one off for now - I think there is still some room for improvement in the re-try process but I'll flag that as a separate improvement (have created issue #65 for this.