ERDDAP / erddap

ERDDAP is a scientific data server that gives users a simple, consistent way to download subsets of gridded and tabular scientific datasets in common file formats and make graphs and maps. ERDDAP is a Free and Open Source (Apache and Apache-like) Java Servlet from NOAA NMFS SWFSC Environmental Research Division (ERD).
Creative Commons Zero v1.0 Universal
76 stars 54 forks source link

Record requests in a structured format #118

Open SarahSidders opened 8 months ago

SarahSidders commented 8 months ago

ERDDAP currently records requests in a plain text format. We would like to propose changes to be made so that ERDDAP records these details in a structured format that can be used by the ERDDAP community to more easily capture this data for reporting purposes.

We have created a branch that details what the changes could be to achieve this structure.

• RequestDetails.java - This new class is used to create an object for the request data
• UsageMetrics.java - This new class is used to process the request details object into a json format
• Erddap.java - Updated this class to include the RequestDetails object and call the UsageMetrics sendUsageMetrics method

This is a proposed idea, and we're happy for any different approaches/changes that can be made to improve this for community use.

Please see the pull request: https://github.com/ERDDAP/erddap/pull/117

ChrisJohnNOAA commented 8 months ago

Hi, thanks for the issue and pull request. Could you explain how you want to use this so I can make sure I understand the intentions?

After a quick look at the pull request, I have a few questions.

  1. Why are you only logging status===200 requests?
  2. Should this new log be an option that can be enabled by a flag?
  3. Is there a reason to log both the old and new logs?
  4. Any reason "loggedInAs" is not included in the new log?
  5. Especially since a structured log format implies there will be dependencies on these logs, can you add tests to verify the logs are in the proper format?
BobSimons commented 8 months ago

Just curious: how is this different/better than the Tomcat request log which is also in a standard format?

thogar-computer commented 8 months ago

A little background. we needed to see what had been downloaded going back as far as we could. our email setup hadn't been working due to some internal IT changes, therefore we fell back onto erddap's logging system, a grep did fetch the information required but only back about a year.

All of that work lead us to realising we needed a good way to catch what was being requested. Currently we are investigating a way to duplicate traffic (i.e. monitor requests before they make it to erddap) - this is more of a challenge when we add in the https connections rather than http and the response tracking.

Given this is being driven from our remit to deliver data as a national data center (plus funders want to know), we thought that other institutes might want a simple way to see what erddap was delivering out to their communities.

@BobSimons I see this as better, as it isn't dependent on tomcat's setup and without knowing every institutes setup it seems plausible that some places might have access to the erddap directories and not to tomcats logging.

@ChrisJohnNOAA These are all very good points and most of my answers resolve to because we need that info, more details below:

As stated this work is very much something we wanted to discuss with the community to see if it is of interest to anyone else, as we require this information for reporting purposes we will be continuing investigation around the traffic routing. If the idea in this issue is of use to the wider community then we'll swap to this way of collecting the information in the future.

ChrisJohnNOAA commented 8 months ago

Even if you only want successfully downloaded/accessed data, you probably want to include status codes besides 200. Specifically 206 (partial content) and redirects (300-308) might be useful. On a broader note, I'd be interested in feedback from others on if things like error requests would be useful here.

As for changes to the existing logging, my recommendation would be to have a setting that allows configuration of what logs should be generated. Most likely it should default to the current logging implementation, and have options for the new structured, and both together. Though I'd also appreciate feedback from others running ERDDAP servers on what they'd like to see here.

As for testing, my preference would be a JUnit style unit test, specifically testing the logging capabilities. However since I don't currently have an example JUnit test, I'd be happy with any test.