datamade / dossier

Machine assisted dossiers
MIT License
19 stars 2 forks source link

Start from the evidence? #1

Open herdingbats opened 7 years ago

herdingbats commented 7 years ago

Thanks for throwing this out there. I've jotted down some notes, approaching this problem in two ways, the first from my disciplinary perspective as a historian. The other was considering who's built the infrastructure already; what can we repurpose (the answer, not to be coy, is Facebook). But it makes sense to start from the beginning.

What's the fundamental unit of analysis we're dealing with? It isn't the judgement or the opinion but rather the piece of evidence. That piece of evidence is what we build our opinions and analyses on, and what we return to when we have new evidence or reason to revisit our opinions. That evidence needs to be preserved. We also need to preserve what we know about the evidence, what are its origins and provenance?

From that evidence, what facts can we infer? No, what sorts of facts can we infer? Events, entities, and relationships come to mind—but we can (can't we?) define events as time-delimited relationships, meaning that the facts we infer exist as a graph database over time. That graph is in turn mapped onto the evidence via another graph database; every node of this (meta)graph is an interpretation.

Here's the model: Evidence recorded Connected to other assertions via interpretation Metadata: source/provenance, time, type, etc. Examples: documents, databases, recordings, immediate notes (minimally interpreted), photos, video

Interpretations Explicit Machine- or human-made Confidence level? Timestamp

Entities have characteristics, all of which are either: membership in classes relationships to other entities What about scalar attributes, like income? Should income (for example) be recorded as a relationship ("A received $60,000 from B" or "A received $60,000 from unknown sources") Unknown would be a special type of entity.

Time is important: these memberships and relationships can either have known beginnings and endings or unknown beginnings and endings (but implied termini ante/post quem—by asserting an existing fact from evidence at a given time, you're asserting that this fact came to exist before that time)

Classes are another type of entity, so they also have characteristics which are also: membership in classes relationships to other entities

Generalities about classes are worth recording; they imply (in a fuzzy or probabilistic sense) relationships between entities (think status).

Relationships Everything is a relationship! (This is a graph-database way of thinking) "A pays B $5,000 on March 5, 2017" is a time-delimited relationship "A is paying off B" is a relationship of unknown time duration ("...and I learned of it on March 5, 2017" supplies a terminus ante quem for its beginning) "A was born on March 5, 1955" is the creation of an entity, also A joins classes ("family", "social class", "gender", "race", "nationality", "religion", etc.) "Oceania is at war with Eastasia" is the relationship between two classes Some relationships are well-defined (marriage), others are illegal (murdered, bribed), others are implicit. Still others are mechanical but important ("payment to account number #028495782348988")

Wait, you mentioned Facebook? Let's not reinvent the wheel; who has done this work already? This is, essentially, the knowledge model of Facebook. It's based on inferring relationships and chaaracteristics of entities based on the primary sources (posts/shares) and the interactions among the entities around them. Here's a decent map: https://labs.rs/en/facebook-algorithmic-factory-immaterial-labour-and-data-harvesting/

But I don't want to build Facebook! (Or do I?) OK, what's essential here? We could map all of human knowledge this way or we could start with what we really want to do. On first blush, it seems like there are a few main things that a reporter could use this sort of dossier to discover and prove: -Unusual relationships ("A pays B but no one else in A's classes pays anyone else in B's classes") -Undiscovered patterns ("Everyone in class A pays everyone in class B") -Timing of relationships, events. (This is key to asserting causality.) -Pervasiveness of relationships (e.g. another place where structural racism crops up)

And what's the interface? Imagine a tool that would have helped with some representatively complex stories...

fgregg commented 7 years ago

Thank you.

I think you are making the right type of move to separate "evidence" and "interpretations." I put them in quotes, because it's often near impossible to separate the too. That said, I think there is, often, this hierarchical dependency you talk about. Right now, I like the idea of "warrants".

Using a campaign finance example

Claim 1. A clerk wrote that a person with the name "John Doe" has an address at "1600 Pennsylvania Avenue" and gave "$500" on "October 11, 2009". Warrant: This is based on our interpretation of the meaning of the column names and the entries in a file with this name

Claim 2: The person referenced in Claim 1 is this person: Warrant: This is based on our machine learning algorithm and clerical review.

Claim 3: This person has an unusual pattern of giving. Warrant based upon the giving attributed to this person in Claim 2 and other claims for this and other persons at the same level

I think there can be a lot of advantage in keeping these orders of claims separate.

herdingbats commented 7 years ago

"Warrant" is a great description of that type of dependency.

Thinking down to building useful tools, is it more helpful to start from the network of entities and relationships (the whiteboard/conspiracy wall) or from the evidence-storage (notebooks and file folders)? Which end is the MVP? (Here's where we need to talk to journalists; they've got the pain point + deadline.)