OBOFoundry / purl.obolibrary.org

A system for managing OBO PURLs
BSD 3-Clause "New" or "Revised" License
76 stars 128 forks source link

Logs and Analysis #63

Closed jamesaoverton closed 3 years ago

jamesaoverton commented 8 years ago

This new system presents an opportunity to log and analyze OBO traffic.

  1. We should log failures to look for gaps in the new configuration, and other problems and strange behaviour.
  2. We can provide metrics on OBO usage that could be very valuable, e.g. for demonstrating the use of our ontologies.

My experience, though, is that log analysis systems are either a real pain to set up or kind of expensive (or both). We don't have a lot of resources, time or money, for this.

I've built ELK stacks (Elasticsearch, Logstash, and Kibana) from open source parts. I found it a pain to set up, fragile, and pretty heavy: at least a medium sized server is required.

I don't have experience with commercial offerings, but there are many options, including hosted ELK:

Suggestions? @cmungall @alanruttenberg @kltm

cmungall commented 8 years ago

I'm not familiar with the different systems. I think this is v important though. If you and @kltm scope out the ideal system we can look at finding the resources to do this, possibly at OHSU

jamesaoverton commented 8 years ago

I don't have any time left for this.

It doesn't have to get done tomorrow -- we'll just store our Apache logs for now. But I think it's important to the wider community, and it would be good to do it soon. I'll send an email about this to OFOC.

jamesaoverton commented 8 years ago

Some thoughts on user privacy:

The system I have in mind would only analyze our Apache web server logs. It would not use cookies, and there would not be much user-specific information: just an IP address, the browser's UserAgent string, and the users HTTP Referrer (if provided). That's information that the user has explicitly sent to us in their HTTP request. It's helpful for counting "unique" users, and doing some crude geolocation, but not much else. We would not be sharing this information or trying to track users across multiple sites.

alanruttenberg commented 8 years ago

I'll do some poking around. I'm in agreement that this is a great opportunity, as long as privacy and respect for our clients is forefront. I haven't looked at the logs, but maybe some simple scripting to analyze them may be useful. I think logs are a pain to analyze because of volume, but we may not have so much that we can't handle it with common programming.

jamesaoverton commented 8 years ago

Even without a fancy logging system, I can make a Google Spreadsheet with a selection of failing paths, and we can discuss them on the OFOC call and obo-discuss list.

jamesaoverton commented 8 years ago

I'm currently logging IP addresses. The first part of each log line is in Common Log Format, which is compatible with the majority of logging tools.

In a comment on commit https://github.com/OBOFoundry/purl.obolibrary.org/commit/06521b1134be2f0380a803fd64cb2932ee93d110#commitcomment-14597187 @alanruttenberg suggests using Apache HostnameLookup to facilitate log analysis. That document says:

The default is Off in order to save the network traffic for those sites that don't truly need the reverse lookups done. It is also better for the end users because they don't have to suffer the extra latency that a lookup entails. Heavily loaded sites should leave this directive Off, since DNS lookups can take considerable amounts of time. The utility logresolve, compiled by default to the bin subdirectory of your installation directory, can be used to look up host names from logged IP addresses offline.

I think that minimizing latency is a good reason to leave HostnameLookups Off. We can do offline hostname resolution using logresolve, as suggested.

jamesaoverton commented 8 years ago

Here is a quick-and-dirty analysis of unhanded requests for from November 25 to 30:

https://docs.google.com/spreadsheets/d/171m26KaTkxJx_3Kf4nMiNTisINshdeUf9GgxTmW0Rcg/edit

kltm commented 8 years ago

Wow. In about a week there's order 200k misses? Do you know where those top two are coming from? Because that's a lot of wrong happening incredibly fast.

jamesaoverton commented 8 years ago

@kltm: Yes, half of all traffic is due to #90.

cmungall commented 8 years ago

I suspect most of these come from some kind of systematic rewrite of IDs or URLs. Not sure it is a major worry.

On 1 Dec 2015, at 12:21, kltm wrote:

Wow. In about a week there's order 200k misses? Do you know where those top two are coming from? Because that's a lot of wrong happening incredibly fast.


Reply to this email directly or view it on GitHub: https://github.com/OBOFoundry/purl.obolibrary.org/issues/63#issuecomment-161084398

cmungall commented 6 years ago

We should return to this, we said we'd do this as part of the services grant.

We need a sustainable system in the long run, but in the short term I'm OK with paying for a service that will relieve us of doing any work ourselves. We can then get a better handle on things and decide long term course of action (which may be something lo-cost like static sites built from apache logs).

@kltm has some indirect familiarity with splunk (which you mention above). Seems people who use it are happy with it

kltm commented 6 years ago

I would note that institutionally that we have little knowledge of Splunk, besides that some of the MODs seem to like it. @cmungall commented on just dropping logs into S3. I think that that, combined with one of the many static apache log analyzer/analytics site builders, might make for a very hands-off and robust open system.

kltm commented 6 years ago

As examples: https://goaccess.io/, AWStats, webalyzer (last two rather old, but the packages remain in deb/ubu).

jamesaoverton commented 6 years ago

My experience amounts to a lot of time wasted on Elastisearch + LogStash + Kibana a few years ago. After that I took a quick look at GoAccess and liked what I saw.

I'm fine with either option: Splunk or S3 and GoAccess -- I don't have a lot of time to mess around. I don't think we need live results at this point.

The PURL server is a t2.micro instance running Debian. The logs should be pushed somewhere else for analysis (or pulled, I guess). @kltm Can you suggest a simple and robust solution?

kltm commented 6 years ago

@jamesaoverton Looking a bit more at goaccess (https://goaccess.io/faq#html-report) and playing with it for ten minutes (it's a standard package in ubuntu universe and has an upstream repo for newer versions), I think I cannot overstate how nice and easy to use it is.

My recommendation would be one of the following:

  1. All in

    • Install in image https://goaccess.io/download#official-repo (might work with debian? worst case is compiling from source, which is also suuuper easy)
    • run some variation of: goaccess /var/log/apache2/access.log --log-format=COMBINED -a -o /var/www/report.html
    • add the above command to cron
    • declare victory
  2. AWS-lite

    • Drop logs into S3
    • Have job somewhere download all logs and do the above
    • push report into S3 site

The advantage of the latter would be to have logs that could possibly better survive issues with the VM and access.log.* getting too large (and keeping things from getting rotated out of existence). That said, getting 1 done would only take a few minutes and give people time to think about what they want.

jamesaoverton commented 6 years ago

I would rather do 2, because I don't want any unpredictable loads on the PURL server itself, even in the short term.

This is on my To Do list, I just haven't had time. If anyone else wants to take a stab at it, I can authorize your SSH into the server.

kltm commented 6 years ago

@jamesaoverton We will be implementing 1 or 2 in the next week or so as we get ready to go live with our new production site. Not like it's all that hard, but if you're interested you may be able to reuse some bits that we put together.

kltm commented 6 years ago

@jamesaoverton Just an update on what we're doing (and I invited you to a (for the time being) private repo where we're doing some operations stuff for GO). We now have a nice system using the system's logrotate to push all apache logs to S3 for later analysis; we also have a separate log getter script, which would be used to run goaccess.io. It's looking easy to us. The only thing we don't have is something to report if the logs are not delivered.

kltm commented 5 years ago

To make the steps concrete:

matentzn commented 4 years ago

Is there a timeline for this or has this been postponed indefinitely (just for my information, so I can communicate this accurately). Thanks all!

jamesaoverton commented 4 years ago

We promised to do this as part of the OBO Technical Services Grant, which runs out on a month or two. I'll try again to push it to the top of my To Do list.

kltm commented 3 years ago

Okay I wanted to warm this ticket up again and introduce @abessiari, who will be helping us out with this. From https://github.com/OBOFoundry/purl.obolibrary.org/issues/63#issuecomment-457365282, the first thing we probably want to do is make sure that the logs are getting rotated out to S3. For that, we'll need to make sure there's an S3 bucket available, apply it to the PR, accept the PR, then roll out.

kltm commented 3 years ago

I think the open question here is: is this best to 1) manually apply to the running instance as a one-off? 2) understand the current deployment system and add to that? (i.e. https://github.com/OBOFoundry/purl.obolibrary.org/pull/496)

jamesaoverton commented 3 years ago

I'd be content with a one-off. I'd be delighted with an update to the deployment system.

kltm commented 3 years ago

Okay, I think the easiest and most productive thing would be to test the PR out using the instructions at https://github.com/OBOFoundry/purl.obolibrary.org#deployment on a mock deployment and a mock location. When we are able to make that work, we'll ping you.

kltm commented 3 years ago

@abessiari I think the next steps probably look like:

After that, we can maybe look at analysis a little for the logs.

abessiari commented 3 years ago

Sounds like a plan. Thanks. aes

On Tue, Mar 23, 2021 at 5:39 PM kltm @.***> wrote:

@abessiari https://github.com/abessiari I think the next steps probably look like:

After that, we can maybe look at analysis a little for the logs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OBOFoundry/purl.obolibrary.org/issues/63#issuecomment-805381755, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALQOQSA2ZLHOMW2EJFLZARTTFEYFNANCNFSM4BUS5FLA .

kltm commented 3 years ago

Current work/convo here: https://github.com/OBOFoundry/purl.obolibrary.org/pull/747

kltm commented 3 years ago

Noting the existence of some work on (apache) log analysis here: https://github.com/abessiari/apache-logs (in tracker at https://github.com/berkeleybop/bbops/issues/15)

kltm commented 3 years ago

While this will still be referred to, closing in preference to broader #754