Closed Mr0grog closed 4 years ago
Couple questions:
The significance
and priority
fields mentioned in the issue description look like they're currently on Change rather than Annotation. Should I move them with a migration? And should I add accessors to Change so that existing reads/writes continue to work, or start reaching through to the change's annotation? I suppose that this is somewhat contingent on whether we expect a change to ever have multiple annotations, e.g. by multiple analysts writing up the same change?
It sounds like this code would be a more natural fit in the ruby/wm-db codebase, given the author-setting issue you described as well as the likelihood that things like the annotation-schema-version
you mentioned should probably not be API-visible (I imagine we will want to unify the public API version of the annotation). If you agree, I'm happy to do this on the ruby side, just want to confirm that I'm not missing some counterargument (since I know next to nothing about this codebase yet :)
- The
significance
andpriority
fields mentioned in the issue description look like they're currently on Change rather than Annotation. Should I move them with a migration? And should I add accessors to Change so that existing reads/writes continue to work, or start reaching through to the change's annotation? I suppose that this is somewhat contingent on whether we expect a change to ever have multiple annotations, e.g. by multiple analysts writing up the same change?
Ah! This might need some explanation. Changes aren’t meant to be writable, and the only information they store is a denormalized version of what’s in the annotations that are attached to them (i.e. it’s just a shortcut for easier access or database indexing). We do expect changes to have multiple annotations.
Changes have a current_annotation
, which should never be written to by any object other than the change itself. It’s basically what you would get by merging all the annotations for that change together — the idea here is that this is a convenient summary of everything that anybody’s had to say about the change, even though you might still want to dig into individual annotations if you want to explore who said which things, or if anybody disagreed. See also this old discussion: edgi-govdata-archiving/web-monitoring-db#375
They also have significance
and priority
fields, which are, again, just shortcuts extracted from the annotations. If you post an annotation with one or both of those fields, it’s treated specially: they are validated to ensure they are a number between 0 and 1, and the change object that owns the annotation extracts it into a top level field so that it can be easily indexed by Postgres so we can search on it later (e.g. give me all the changes with priority > 0.5
). We didn’t think it was worth denormalizing those into special fields on annotation objects, too, since it was less likely they’d be needed for searching that way. In all cases, the canonical significance or priority value is the one in a particular annotation’s annotation
field. Everything else is just a shortcut. Does that make sense?
things like the annotation-schema-version you mentioned should probably not be API-visible (I imagine we will want to unify the public API version of the annotation)
Actually, I think that info should be API-visible. An annotation is meant to be any old pile of JSON object (with the exception that priority
and significance
are treated specially), so any application we might write can feel free to stuff whatever info might be relevant in there. The database and API are mostly agnostic to its content. So the idea with that version field is that it would just be a signal to any other application reading an annotation back out that might best be displayed in a particular way.
For example, the older annotations from 2017 have totally different fields and formats, and a UI for exploring our annotations/changes would want to know how best to display a given annotation or how to present it for editing. A given field name might be best displayed with a dropdown or radio button list, so it might be helpful to have something like the type or version of the annotation so the UI knows how to treat it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.
While the web-monitoring-db project has the ability to store annotations (mostly free-form information from a human or bot about what exactly has changed between two versions of a page), the analyst team doesn’t currently make use of it. We’d like to start surfacing annotation information in the UI, and one way to do so is to import the annotations they currently make in spreadsheets into the database.
(There was some previous discussion on this in edgi-govdata-archiving/web-monitoring-db#61, but I’ve made this issue to be a bit fresher and more concise.)
Analysts currently have spreadsheets formatted with the following columns:
significance
in the database annotation, which is a number between 0-1. There’s a lot of room for interpretation here, but I’m thinkinglow = 0.5, medium = 0.75, high = 1.0
. (This column is only in the important changes sheet, hence starting at0.5
even a low importance is still somewhat significant just by virtue of being in this particular spreadsheet.)The biggest thing to note here is that, because the way we calculate the value for a lot of fields has changed over time, you should probably use either the “This Period - Side by Side” or “Latest to Base - Side by Side” columns to determine the page and version IDs to annotate. They will always be in the format:
Where
ABC
is the Page’s UUID andDEF..GHI
is the change ID. To break that down a bit more, a change IDDEF..GHI
indicates the change between the version with UUIDDEF
and the version with UUIDGHI
. SometimesDEF
will be missing, so a change ID could be..GHI
. That means the change between the version immediately precedingGHI
and the version with UUIDGHI
.Over on the database/API side of things, we have:
We want to take each row from an analyst spreadsheet, look up the relevant Versions (or really, the relevant Change), and create an annotation with all the data from the last several rows of the sheet.
I can see this written either as a Rake task inside web-monitoring-db or as a Python script using the tools in web-monitoring-processing’s db.py module:
If written inside web-monitoring-db, it has direct access to the database. You can look up a change with
Change.find_by_api_id('DEF..GHI')
. It’ll throw aRecordNotFound
error if there is a problem with the IDs.Then you can create the annotation with:
If written as a Python script, you’ll have to use the public API to create the annotation:
Or using the Python DB wrapper:
Other notes and caveats:
annotation_version
field or something in the annotation’s data so tools reading the data back out later know how to treat it.significance
field in the annotation will have to be treated differently for individual analyst sheets vs. the “important changes” sheet:author
for the annotation, but that’s a bit complicated so we should skip it for the moment.