Building the Data Model

Vision

My vision for this project is to create a lightweight, hostable knowledge repository regarding the history of women in Berlin and the places named for them. With this in mind, I have a few key goals:

That it should be easily and inexpensively hosted;
That people can contribute reviewable knowledge to it;
That we can use data to quantify the comparative presence of men vs. not-men in public history;
That we can incorporate mapping tools to make it compelling.

Given this, the first objective is to figure out how to distill the 8000+ public place names in Berlin into a workable form and have a flexible data model off which this can be built. Some of the key factors regarding the aforementioned points are discussed here.

Lightweight, hostable solution

My vision is to be able to use a static site generator (SSG) to create a lightweight, easily clonable solution. SSG hosting is inexpensive and easy to set up. The general SSG workflow is something like this: Generate source material > compile source material using SSG templates > host compiled material in basic HTML format.

I have built SSG sites that can be hosted for roughly 1 euro a month on Amazon S3. I would like to replicate this here. Most SSG sites use Markdown (md) files as source material. These often contain some YAML frontmatter and the SSG tool uses templates to take the md file and convert it into HTML, building all the necessary path references for images, etc. as necessary.

For this project, we will need steps to the left of generating the source material. Namely, we need to extract street names and sanitize data before source material can be generated. This will be discussed below. One of the drawbacks to static sites is that they are not dynamic (vacuously), which means that a build pipeline will have to be integrated. This is not so difficult to accomplish using AWS Lambda and webhooks.

User-contributable

Ideally, this knowledge based will be contributed to by many people in a peer-reviewable way. The process of extracting place names and sanitizing data will largely be a manual process. I estimate that 80% or more of the place names can be extracted automatically using basic pattern matching techniques. The remaining names will require either manual intervention or more complex tooling. Because of the human-digestible size of the data, it makes more sense to use manual intervention that a complex scaffold of statistical tools.

By way of example, many streets named after people have take the form of, e.g. Hannah-Karminski-Straße. However, this is not universal, and it is simple enough to write a more comprehensive set of rules to capture, e.g. Kopernikusstraße, but at some point the process of identifying all the heuristics becomes the task of just doing the work manually to begin with.

In any case, extraction of place names is only one step. The data will need to be sanitized and this will be a largely human-centered process. For instance, Melli-Beese-Straße and Amelie-Beese-Zeile are (probably?) both named for the same person, but this would not be clear unless we further incorporate a nickname correlation dictionary. This is not worth it for a few hundred data points. Instead, we should rely on human-centered knowledge to be able to merge these entries as two place names referencing the same historical person.

The ideal model will allow users to merge the two entries for these two places into one entry representing a person with two place names linked to that identity. This may change over time, as not every such relationship may be immediately clear.

Beyond the data extraction/sanitization issues, my vision for this project includes the ability to contribute additional material and knowledge and to edit information as necessary in the future. The ideal flow will use a Github-like (or even Github-actual) PR model, where peer review is necessary to publish content. An overarching vision I have is to make Github-style PR flows more familiar to academicians outside of the technology industry. This has a number of benefits over the Wiki model (see: Why not just use a wiki? below).

Upon a merged PR, the content generation pipeline should kick off automatically.

Metrizable

Ideally, this project will be able to expose how women are treated in the historical record vs men. (Because we are talking about modern European history, I am using a false gender binary dichotomy, as it is more likely that binary gender will overwhelmingly arise in the historical record). Therefore, our data should be in a place where we can quantify the history of women memorialized in the names of Berlin's public spaces, including numbers of people, numbers of places per person, the centrality of those locations, the lifespans and lifetimes of those people, the searchable record (e.g. number of words per wikipedia article), and so on. As such, the underlying data should make this easy and replicable.

Map Integration

Mapping data is often highly specific and not always easy to present concurrently with other data (such as extensive text). Therefore, we should be sure that our data includes ways to incorporate mapping information in a way that can be easily extracted or converted to use with various mapping software (e.g. Open Street Maps).

Approach

This epic will capture tickets to discuss the approach using the four key points as described above. Roughly outlined, the work will precede according to the following vague roadmap:

Build the ETL pipeline.
Build the revision cycle pipeline.
Build tools to convert persisted information to suitable SSG format -- an SSG pre-generator of sorts.

I retain the right to change this last step somewhat, as it appears to have the potential for being a self-defeating rabbit hole.

Why not just use a wiki?

The Wiki model has a number of flaws, in my opinion. First, it is much more difficult to host on a static site. Second, the edit-by-anyone model has led to quite a bit of corruption of information, trolling, and academic dishonesty. Wikipedia isn't considered a reliable source for many reasons, but publish-before-review is a major one. Also, Wikis are terribly old-school in terms of basic formatting and usability. I dislike the model, and I believe that the peer review/pull request model is superior and will allow for better content moderation and accuracy.

Gorcenski / women-streets-berlin