edgi-govdata-archiving / web-monitoring

Documentation and project-wide issues for the Website Monitoring project (a.k.a. "Scanner")
Creative Commons Attribution Share Alike 4.0 International
105 stars 17 forks source link

Automatically create annotations in database from old analyst spreadsheets #141

Closed Mr0grog closed 4 years ago

Mr0grog commented 5 years ago

While the web-monitoring-db project has the ability to store annotations (mostly free-form information from a human or bot about what exactly has changed between two versions of a page), the analyst team doesn’t currently make use of it. We’d like to start surfacing annotation information in the UI, and one way to do so is to import the annotations they currently make in spreadsheets into the database.

(There was some previous discussion on this in edgi-govdata-archiving/web-monitoring-db#61, but I’ve made this issue to be a bit fresher and more concise.)

Analysts currently have spreadsheets formatted with the following columns:

The biggest thing to note here is that, because the way we calculate the value for a lot of fields has changed over time, you should probably use either the “This Period - Side by Side” or “Latest to Base - Side by Side” columns to determine the page and version IDs to annotate. They will always be in the format:

https://monitoring.envirodatagov.org/page/ABC/DEF..GHI

Where ABC is the Page’s UUID and DEF..GHI is the change ID. To break that down a bit more, a change ID DEF..GHI indicates the change between the version with UUID DEF and the version with UUID GHI. Sometimes DEF will be missing, so a change ID could be ..GHI. That means the change between the version immediately preceding GHI and the version with UUID GHI.

Over on the database/API side of things, we have:

We want to take each row from an analyst spreadsheet, look up the relevant Versions (or really, the relevant Change), and create an annotation with all the data from the last several rows of the sheet.

I can see this written either as a Rake task inside web-monitoring-db or as a Python script using the tools in web-monitoring-processing’s db.py module:

Other notes and caveats:

mrotondo commented 5 years ago

Couple questions:

  1. The significance and priority fields mentioned in the issue description look like they're currently on Change rather than Annotation. Should I move them with a migration? And should I add accessors to Change so that existing reads/writes continue to work, or start reaching through to the change's annotation? I suppose that this is somewhat contingent on whether we expect a change to ever have multiple annotations, e.g. by multiple analysts writing up the same change?

  2. It sounds like this code would be a more natural fit in the ruby/wm-db codebase, given the author-setting issue you described as well as the likelihood that things like the annotation-schema-version you mentioned should probably not be API-visible (I imagine we will want to unify the public API version of the annotation). If you agree, I'm happy to do this on the ruby side, just want to confirm that I'm not missing some counterargument (since I know next to nothing about this codebase yet :)

Mr0grog commented 5 years ago
  1. The significance and priority fields mentioned in the issue description look like they're currently on Change rather than Annotation. Should I move them with a migration? And should I add accessors to Change so that existing reads/writes continue to work, or start reaching through to the change's annotation? I suppose that this is somewhat contingent on whether we expect a change to ever have multiple annotations, e.g. by multiple analysts writing up the same change?

Ah! This might need some explanation. Changes aren’t meant to be writable, and the only information they store is a denormalized version of what’s in the annotations that are attached to them (i.e. it’s just a shortcut for easier access or database indexing). We do expect changes to have multiple annotations.

Changes have a current_annotation, which should never be written to by any object other than the change itself. It’s basically what you would get by merging all the annotations for that change together — the idea here is that this is a convenient summary of everything that anybody’s had to say about the change, even though you might still want to dig into individual annotations if you want to explore who said which things, or if anybody disagreed. See also this old discussion: edgi-govdata-archiving/web-monitoring-db#375

They also have significance and priority fields, which are, again, just shortcuts extracted from the annotations. If you post an annotation with one or both of those fields, it’s treated specially: they are validated to ensure they are a number between 0 and 1, and the change object that owns the annotation extracts it into a top level field so that it can be easily indexed by Postgres so we can search on it later (e.g. give me all the changes with priority > 0.5). We didn’t think it was worth denormalizing those into special fields on annotation objects, too, since it was less likely they’d be needed for searching that way. In all cases, the canonical significance or priority value is the one in a particular annotation’s annotation field. Everything else is just a shortcut. Does that make sense?

things like the annotation-schema-version you mentioned should probably not be API-visible (I imagine we will want to unify the public API version of the annotation)

Actually, I think that info should be API-visible. An annotation is meant to be any old pile of JSON object (with the exception that priority and significance are treated specially), so any application we might write can feel free to stuff whatever info might be relevant in there. The database and API are mostly agnostic to its content. So the idea with that version field is that it would just be a signal to any other application reading an annotation back out that might best be displayed in a particular way.

For example, the older annotations from 2017 have totally different fields and formats, and a UI for exploring our annotations/changes would want to know how best to display a given annotation or how to present it for editing. A given field name might be best displayed with a dropdown or radio button list, so it might be helpful to have something like the type or version of the annotation so the UI knows how to treat it.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.