arthurpsmith / author-disambiguator

Wikidata service to help create or link author items to published articles
GNU General Public License v3.0
33 stars 8 forks source link

Pages load very slowly #106

Closed Daniel-Mietchen closed 4 years ago

Daniel-Mietchen commented 4 years ago

Lots of calls to the tool take a long time to respond. For instance, https://tools.wmflabs.org/author-disambiguator/?doit=Look+for+author&name=Jin-Woo%20Jung&filter=wdt%3AP50+wd%3AQ56487396 just took about a minute.

I assume that this can be tweaked by making the default search parameters less fuzzy.

arthurpsmith commented 4 years ago

Something's going on right now that's making pages not load at all - I can't figure out the underlying problem; the server is freezing up somehow. I can log into the kubernetes "pod" and look around, but I can't tell what's going on, somehow it's not responding to requests?!

Daniel-Mietchen commented 4 years ago

When I filed the ticket, the tool still worked, albeit very slowly. It stopped working entirely about a day ago. I have looked through Phabricator but did not see anything that would be a likely candidate to explain this.

arthurpsmith commented 4 years ago

@Daniel-Mietchen it's working now; the best guess I have right now is that it was swamped with requests for some reason yesterday. Today it's only been getting a few and hasn't frozen at all yet. I guess I'll need to look into how to make it a bit more robust...

Daniel-Mietchen commented 4 years ago

It seems down again.

arthurpsmith commented 4 years ago

I restarted. I do have access logs now, hopefully I can track down what's happening this time.

Daniel-Mietchen commented 4 years ago

Still down for me, or perhaps again.

Daniel-Mietchen commented 4 years ago

It's back up again.

arthurpsmith commented 4 years ago

It looks like the problem is a spider or bot (or maybe 2 of them) that are following the non-OAuth links from somewhere. I'm going to look at rate-limiting those pages.

arthurpsmith commented 4 years ago

Hmm - looks like the spider/bot stopped about 24 hours ago. It might be because I added a "robots.txt" file at the top level to block any robot indexing of the pages (I added that about 11 hours earlier). I'll leave things alone for now on this but if it pops up again I'll definitely look at rate-limiting or other options...

Daniel-Mietchen commented 4 years ago

Still down.

arthurpsmith commented 4 years ago

Not quite down - if you wait long enough you'll get a response now. I've added an initial rate-limit, but I may have to make it stricter. The problem seems to be these bots are generating complex query pages that take minutes to run, and occupying all the available fast-CGI processes; generally the server will wait until one becomes free, but it may take a few minutes. I am trying to think of a way to better throttle these bot queries without bothering regular users much... I might need to turn off some features.

arthurpsmith commented 4 years ago

(obviously the robots.txt file didn't work though!)

arthurpsmith commented 4 years ago

I've added a severe throttle on any queries for more than 100 articles for a name on the non-OAuth pages. Basically you won't be able to get a useful response there unless the service is very quiet, and then only about once in 5 minutes. This seems to have at least calmed down the bot activity and pages are responsive again. I can try adjusting these settings if it's causing trouble for people...

Daniel-Mietchen commented 4 years ago

As things stand, the tool is almost unusable for my use cases, which usually involve coming from a Scholia /missing page to a page like https://tools.wmflabs.org/author-disambiguator/?doit=Look+for+author&name=Shandell%20Pahlen , which now usually state

Too many requests
Please wait before making another request of this service; note that use of the OAuth option is not rate-limited.

It would be useful to add a link to the corresponding OAuth option, i.e. https://tools.wmflabs.org/author-disambiguator/names_oauth.php?name=Shandell%20Pahlen for the above example. Of note, even these OAuth pages take longer these days than they did about a month ago.

Once the OAuth option is stable, we can switch all the links from Scholia to it directly.

Daniel-Mietchen commented 4 years ago

Another issue is that the above error occurs even on item pages like https://tools.wmflabs.org/author-disambiguator/author_item.php?id=Q47503982 , which makes it hard to verify whether that item is the one to which papers with a certain string are being matched.

arthurpsmith commented 4 years ago

Yeah, I can imagine this is making things hard. I'll look into adding those links, that should be straightforward. What does Scholia do about web crawler/bot downloads? It must have similar issues surely?

Daniel-Mietchen commented 4 years ago

We are currently setting nofollow links https://github.com/fnielsen/scholia/blob/b67d35d19a51a728e9b21a40b342c9eb61ab9981/scholia/app/templates/base.html#L6 but looking into caching, as per https://github.com/fnielsen/scholia/labels/caching .

arthurpsmith commented 4 years ago

All the author disambiguator pages have a similar "meta" tag, but in uppercase. I've added the lowercase version if that makes a difference. I guess adding "rel=nofollow" links might be helpful too. Do you know if there's something that can be done at the top tools.wmflabs.org level to block bots in some way? I don't think the lower-level robots.txt is getting accessed...

Daniel-Mietchen commented 4 years ago

I don't know about that top level stuff, but now even pages like https://tools.wmflabs.org/author-disambiguator/ or https://tools.wmflabs.org/author-disambiguator/index.php give me the "Too many requests" error.

arthurpsmith commented 4 years ago

I've added links to the oauth pages from the overloaded message, I hope that helps.

The "bot" downloads seem to be coming from China - the user-agents in the requests look like:

or

or

I could try blacklisting a few of these user agents and see if that helps?

arthurpsmith commented 4 years ago

I've now added a blacklist for several patterns in these user agents, they should just be getting 403 responses now. Hopefully things will clear up shortly.

Daniel-Mietchen commented 4 years ago

Looks like the blacklisting has helped

By contrast, the slowness is still worse than what we had about a month ago, I think. For instance, the example https://tools.wmflabs.org/author-disambiguator/?doit=Look+for+author&name=Jin-Woo%20Jung&filter=wdt%3AP50+wd%3AQ56487396 from the first comment in this thread still took about 1 min, when I would expect less than half of that.

Of course, the corpus is growing rapidly, which may slow things down, but I am wondering whether some of the functionality could be provided only on demand than by default, in order to save time and also reduce the energy footprint.

Daniel-Mietchen commented 4 years ago

We're working on a caching mechanism for Scholia and hope to have something up and running in a week or so from now.

That could potentially involve some caching of Author Disambiguator pages, e.g. at the point that a Scholia /missing page is rendered.

arthurpsmith commented 4 years ago

I've made a number of changes to the behavior of the interactive pages under "maxlag" conditions - in particular they no longer hang indefinitely waiting for "maxlag" to drop, but return quickly with a note about the problem and a link to retry the edits. I think this may have been responsible for some of the slowness, as with only 4 php server processes to respond, if all 4 were sleeping waiting on maxlag then the pages wouldn't respond at all until that cleared. I'm not really aware of anything else that can be done on this right now though.

arthurpsmith commented 4 years ago

One more issue here that may be key - I'm loading way too much data on authors via the Wikidata API in cases where the matching papers have a lot of authors. I think I can greatly simplify that; looking into it now.

arthurpsmith commented 4 years ago

Hmm, I made this change but the speed improvement seems to be marginal in many cases. Maybe it helps some of the worst offenders though (authors that match to thousand-author-list papers). I'll experiment a bit more, there may be similar places I don't need to load everything via the API.

arthurpsmith commented 4 years ago

New author lists feature speeds things up considerably for matching, if you have a limited set of authors you are looking to match with. If there are further things to do on this let's open a new issue with specifics on what to do.