Closed Daniel-Mietchen closed 4 years ago
Something's going on right now that's making pages not load at all - I can't figure out the underlying problem; the server is freezing up somehow. I can log into the kubernetes "pod" and look around, but I can't tell what's going on, somehow it's not responding to requests?!
When I filed the ticket, the tool still worked, albeit very slowly. It stopped working entirely about a day ago. I have looked through Phabricator but did not see anything that would be a likely candidate to explain this.
@Daniel-Mietchen it's working now; the best guess I have right now is that it was swamped with requests for some reason yesterday. Today it's only been getting a few and hasn't frozen at all yet. I guess I'll need to look into how to make it a bit more robust...
It seems down again.
I restarted. I do have access logs now, hopefully I can track down what's happening this time.
Still down for me, or perhaps again.
It's back up again.
It looks like the problem is a spider or bot (or maybe 2 of them) that are following the non-OAuth links from somewhere. I'm going to look at rate-limiting those pages.
Hmm - looks like the spider/bot stopped about 24 hours ago. It might be because I added a "robots.txt" file at the top level to block any robot indexing of the pages (I added that about 11 hours earlier). I'll leave things alone for now on this but if it pops up again I'll definitely look at rate-limiting or other options...
Still down.
Not quite down - if you wait long enough you'll get a response now. I've added an initial rate-limit, but I may have to make it stricter. The problem seems to be these bots are generating complex query pages that take minutes to run, and occupying all the available fast-CGI processes; generally the server will wait until one becomes free, but it may take a few minutes. I am trying to think of a way to better throttle these bot queries without bothering regular users much... I might need to turn off some features.
(obviously the robots.txt file didn't work though!)
I've added a severe throttle on any queries for more than 100 articles for a name on the non-OAuth pages. Basically you won't be able to get a useful response there unless the service is very quiet, and then only about once in 5 minutes. This seems to have at least calmed down the bot activity and pages are responsive again. I can try adjusting these settings if it's causing trouble for people...
As things stand, the tool is almost unusable for my use cases, which usually involve coming from a Scholia /missing page to a page like https://tools.wmflabs.org/author-disambiguator/?doit=Look+for+author&name=Shandell%20Pahlen , which now usually state
Too many requests
Please wait before making another request of this service; note that use of the OAuth option is not rate-limited.
It would be useful to add a link to the corresponding OAuth option, i.e. https://tools.wmflabs.org/author-disambiguator/names_oauth.php?name=Shandell%20Pahlen for the above example. Of note, even these OAuth pages take longer these days than they did about a month ago.
Once the OAuth option is stable, we can switch all the links from Scholia to it directly.
Another issue is that the above error occurs even on item pages like https://tools.wmflabs.org/author-disambiguator/author_item.php?id=Q47503982 , which makes it hard to verify whether that item is the one to which papers with a certain string are being matched.
Yeah, I can imagine this is making things hard. I'll look into adding those links, that should be straightforward. What does Scholia do about web crawler/bot downloads? It must have similar issues surely?
We are currently setting nofollow links https://github.com/fnielsen/scholia/blob/b67d35d19a51a728e9b21a40b342c9eb61ab9981/scholia/app/templates/base.html#L6 but looking into caching, as per https://github.com/fnielsen/scholia/labels/caching .
All the author disambiguator pages have a similar "meta" tag, but in uppercase. I've added the lowercase version if that makes a difference. I guess adding "rel=nofollow" links might be helpful too. Do you know if there's something that can be done at the top tools.wmflabs.org level to block bots in some way? I don't think the lower-level robots.txt is getting accessed...
I don't know about that top level stuff, but now even pages like https://tools.wmflabs.org/author-disambiguator/ or https://tools.wmflabs.org/author-disambiguator/index.php give me the "Too many requests" error.
I've added links to the oauth pages from the overloaded message, I hope that helps.
The "bot" downloads seem to be coming from China - the user-agents in the requests look like:
or
or
I could try blacklisting a few of these user agents and see if that helps?
I've now added a blacklist for several patterns in these user agents, they should just be getting 403 responses now. Hopefully things will clear up shortly.
Looks like the blacklisting has helped
By contrast, the slowness is still worse than what we had about a month ago, I think. For instance, the example https://tools.wmflabs.org/author-disambiguator/?doit=Look+for+author&name=Jin-Woo%20Jung&filter=wdt%3AP50+wd%3AQ56487396 from the first comment in this thread still took about 1 min, when I would expect less than half of that.
Of course, the corpus is growing rapidly, which may slow things down, but I am wondering whether some of the functionality could be provided only on demand than by default, in order to save time and also reduce the energy footprint.
We're working on a caching mechanism for Scholia and hope to have something up and running in a week or so from now.
That could potentially involve some caching of Author Disambiguator pages, e.g. at the point that a Scholia /missing page is rendered.
I've made a number of changes to the behavior of the interactive pages under "maxlag" conditions - in particular they no longer hang indefinitely waiting for "maxlag" to drop, but return quickly with a note about the problem and a link to retry the edits. I think this may have been responsible for some of the slowness, as with only 4 php server processes to respond, if all 4 were sleeping waiting on maxlag then the pages wouldn't respond at all until that cleared. I'm not really aware of anything else that can be done on this right now though.
One more issue here that may be key - I'm loading way too much data on authors via the Wikidata API in cases where the matching papers have a lot of authors. I think I can greatly simplify that; looking into it now.
Hmm, I made this change but the speed improvement seems to be marginal in many cases. Maybe it helps some of the worst offenders though (authors that match to thousand-author-list papers). I'll experiment a bit more, there may be similar places I don't need to load everything via the API.
New author lists feature speeds things up considerably for matching, if you have a limited set of authors you are looking to match with. If there are further things to do on this let's open a new issue with specifics on what to do.
Lots of calls to the tool take a long time to respond. For instance, https://tools.wmflabs.org/author-disambiguator/?doit=Look+for+author&name=Jin-Woo%20Jung&filter=wdt%3AP50+wd%3AQ56487396 just took about a minute.
I assume that this can be tweaked by making the default search parameters less fuzzy.