Automatically create annotations in database from old analyst spreadsheets

Mr0grog commented 5 years ago

While the web-monitoring-db project has the ability to store annotations (mostly free-form information from a human or bot about what exactly has changed between two versions of a page), the analyst team doesn’t currently make use of it. We’d like to start surfacing annotation information in the UI, and one way to do so is to import the annotations they currently make in spreadsheets into the database.

(There was some previous discussion on this in edgi-govdata-archiving/web-monitoring-db#61, but I’ve made this issue to be a bit fresher and more concise.)

Analysts currently have spreadsheets formatted with the following columns:

Index: You should ignore this column, it’s just an incrementing number that some analysts use to keep track of their work.
Unique ID: The way this is generated has changed over time. In older rows, it’s a totally random identifier, and in newer rows it’s the UUID of the latest version in the change. Since it’s indicated different things over time, it may not be great to rely on.
Output Date/Time: The date the spreadsheet row was generated from our database.
Agency: The government agency that manages the page in question
Site Name: The name of the site that the page belongs to
Page Name: The title of the page
URL: The original URL of the page, e.g. https://epa.gov/statelocalenergy
Page View URL: URL to view the page in either our UI or the source system. (In versions imported from Versionista over a year ago, this was the URL of the source page in Versionista instead of in our UI)
This Period - Side by Side: URL to the view the change between the version at end of the previous analysis period (usually one week ago) and the latest version when the row was generated.
Latest to Base - Side by Side: URL to the view the change between the very first version we recorded of this page and the latest version when the row was generated.
Date Found - Latest: Date the latest version was captured (i.e. the “latest” in the “This Period” or “Latest to Base” column).
Date Found - Base: Date the very first version we recorded was captured.
Diff length: Number of characters in page’s source that changed.
Diff hash: SHA-256 of source code diff.
Text diff length: Number of characters in page’s readable text that changed.
Text diff hash: SHA-256 of readable text.
# of Changes this Week: Number of versions between the previous period and the end of this period (i.e. between the two versions in “This Period - Side by Side” including the latest)
Priority (algorithm): Automatically calculated priority for the change between the two versions in “This Period - Side by Side”
Who Found This?: The analyst who judged this change
Importance?: low/medium/high. This should translate to significance in the database annotation, which is a number between 0-1. There’s a lot of room for interpretation here, but I’m thinking low = 0.5, medium = 0.75, high = 1.0. (This column is only in the important changes sheet, hence starting at 0.5 even a low importance is still somewhat significant just by virtue of being in this particular spreadsheet.)
The rest of these fields should just be part of the annotation:
- Language alteration:
- Content change/addition/removal:
- Link change/addition/removal:
- Repeated Change across many pages or a domain:
- Alteration within sections of a webpage:
- Alteration, removal, or addition of entire section(s) of a webpage:
- Alteration, removal, or addition of an entire webpage or document:
- Overhaul, removal, or addition of an entire website:
- Alteration, removal, or addition of datasets:
- Is this primarily a content or access change (or both)?:
- Brief Description:
- Topic 1:
- Subtopic 1a:
- Subtopic 1b:
- Topic 2:
- Subtopic 2a:
- Subtopic 2b:
- Topic 3:
- Subtopic 3a:
- Subtopic 3b:
- Any keywords to monitor (e.g. for term analyses)?:
- Further Notes:
- Ask/tell other working groups?:

The biggest thing to note here is that, because the way we calculate the value for a lot of fields has changed over time, you should probably use either the “This Period - Side by Side” or “Latest to Base - Side by Side” columns to determine the page and version IDs to annotate. They will always be in the format:

https://monitoring.envirodatagov.org/page/ABC/DEF..GHI

Where ABC is the Page’s UUID and DEF..GHI is the change ID. To break that down a bit more, a change ID DEF..GHI indicates the change between the version with UUID DEF and the version with UUID GHI. Sometimes DEF will be missing, so a change ID could be ..GHI. That means the change between the version immediately preceding GHI and the version with UUID GHI.

Over on the database/API side of things, we have:

Pages, which own many…
Versions
Changes, which tie together any two Versions, and which own many…
Annotations, which have an author, a priority (number from 0-1), a significance (number from 0-1), and a JSON blob of arbitrary data.

We want to take each row from an analyst spreadsheet, look up the relevant Versions (or really, the relevant Change), and create an annotation with all the data from the last several rows of the sheet.

I can see this written either as a Rake task inside web-monitoring-db or as a Python script using the tools in web-monitoring-processing’s db.py module:

If written inside web-monitoring-db, it has direct access to the database. You can look up a change with Change.find_by_api_id('DEF..GHI'). It’ll throw a RecordNotFound error if there is a problem with the IDs.

Then you can create the annotation with:
```
change_instance.annotate({
  hash: 'with',
  any: 'data',
  you: 'want'
}, User.find(user_id))
```

If written as a Python script, you’ll have to use the public API to create the annotation:

POST /api/v0/pages/ABC/changes/DEF..GHI/annotations

Or using the Python DB wrapper:

db_client = db.Client.from_env()
db_client.add_annotation(
    annotation={'dict': 'with', 'any': 'data', 'you': 'want'},
    page_id='ABC',
    from_version_id='DEF',
    to_version_id='GHI')

Other notes and caveats:

Dramatically older sheets have a different series of columns at the end, so:
- We’ll eventually need parsing logic for a spreadsheet with the old columns. For now, though, keep it simple and start with the current formatting that is listed in this description.
- We should include some sort of annotation_version field or something in the annotation’s data so tools reading the data back out later know how to treat it.
The significance field in the annotation will have to be treated differently for individual analyst sheets vs. the “important changes” sheet:
- For individual sheets, probably just don’t set it.
- For the important changes sheet, set it based on the “importance?” column (see notes above).
The important changes sheet has a “who found this” column that individual sheets do not. Ideally, we’d find a way to set the correct author for the annotation, but that’s a bit complicated so we should skip it for the moment.
- It’s is easier done from a Ruby script that has direct access to the database (the API’s business logic sets the author based on the user who is POSTing to the API).
- For an external script, we’ll need to create a new permission that allows certain users (i.e. the one running this script) to set arbitrary authors on annotations, and we’ll need a way to look up the correct author to use. All that might be too complicated to be worthwhile.

mrotondo commented 5 years ago

Couple questions:

The significance and priority fields mentioned in the issue description look like they're currently on Change rather than Annotation. Should I move them with a migration? And should I add accessors to Change so that existing reads/writes continue to work, or start reaching through to the change's annotation? I suppose that this is somewhat contingent on whether we expect a change to ever have multiple annotations, e.g. by multiple analysts writing up the same change?
It sounds like this code would be a more natural fit in the ruby/wm-db codebase, given the author-setting issue you described as well as the likelihood that things like the annotation-schema-version you mentioned should probably not be API-visible (I imagine we will want to unify the public API version of the annotation). If you agree, I'm happy to do this on the ruby side, just want to confirm that I'm not missing some counterargument (since I know next to nothing about this codebase yet :)

Mr0grog commented 5 years ago

The significance and priority fields mentioned in the issue description look like they're currently on Change rather than Annotation. Should I move them with a migration? And should I add accessors to Change so that existing reads/writes continue to work, or start reaching through to the change's annotation? I suppose that this is somewhat contingent on whether we expect a change to ever have multiple annotations, e.g. by multiple analysts writing up the same change?

Ah! This might need some explanation. Changes aren’t meant to be writable, and the only information they store is a denormalized version of what’s in the annotations that are attached to them (i.e. it’s just a shortcut for easier access or database indexing). We do expect changes to have multiple annotations.

Changes have a current_annotation, which should never be written to by any object other than the change itself. It’s basically what you would get by merging all the annotations for that change together — the idea here is that this is a convenient summary of everything that anybody’s had to say about the change, even though you might still want to dig into individual annotations if you want to explore who said which things, or if anybody disagreed. See also this old discussion: edgi-govdata-archiving/web-monitoring-db#375

They also have significance and priority fields, which are, again, just shortcuts extracted from the annotations. If you post an annotation with one or both of those fields, it’s treated specially: they are validated to ensure they are a number between 0 and 1, and the change object that owns the annotation extracts it into a top level field so that it can be easily indexed by Postgres so we can search on it later (e.g. give me all the changes with priority > 0.5). We didn’t think it was worth denormalizing those into special fields on annotation objects, too, since it was less likely they’d be needed for searching that way. In all cases, the canonical significance or priority value is the one in a particular annotation’s annotation field. Everything else is just a shortcut. Does that make sense?

things like the annotation-schema-version you mentioned should probably not be API-visible (I imagine we will want to unify the public API version of the annotation)

Actually, I think that info should be API-visible. An annotation is meant to be any old pile of JSON object (with the exception that priority and significance are treated specially), so any application we might write can feel free to stuff whatever info might be relevant in there. The database and API are mostly agnostic to its content. So the idea with that version field is that it would just be a signal to any other application reading an annotation back out that might best be displayed in a particular way.

For example, the older annotations from 2017 have totally different fields and formats, and a UI for exploring our annotations/changes would want to know how best to display a given annotation or how to present it for editing. A given field name might be best displayed with a dropdown or radio button list, so it might be helpful to have something like the type or version of the annotation so the UI knows how to treat it.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.

edgi-govdata-archiving / web-monitoring

Automatically create annotations in database from old analyst spreadsheets #141