Open markwhiting opened 1 year ago
As a related question: what is a column?, or perhaps more specifically, what specifies a column?
Speaking with @linneagandhi, we agreed that at least a few things are required:
name
unit
description
with examples etc.data
options, type and validationSome columns are dependent on others, or are functionally drivable from others.
The goal is a "description that leads to a reliable response", and this of course means that they are often iterated upon and require some type of measure of reliability. Reliability is tricky, as there is not a good standard that works across all data types. Additionally when developing columns it appears that free text response is a needed first step so that we can gain an understanding of the scope of the column. This further challenges reliability and motivates a human review cycle before making higher-level determinations about how data can be validated at a unit level or in aggregate.
However, this does suggest that columns could be summarized in a tidy format such as the following:
name | unit | description | data |
---|---|---|---|
doi | paper | What is the DOI of the paper? | DOI (a subset of URI?) |
conditions | experiment | What conditions did the experiment have? | free_text list |
(Of course, descriptions are probably much more sophisticated than this example)
Further, we may have more aspects to this specification around validation, aggregation, conceptual source, rating mechanism etc. And I could imagine those would all be features in this tidy specification of the set of columns (more columns about columns).
Because our specification encompasses evolution and iterative improvement, we would want to store version
information, perhaps as a GitHub blob or something else that formally identifies the current column among all columns.
A further note from discussion with @linneagandhi is that columns are often created in groups, or relation to other columns. For example you might have a set of columns about how results are reported that are quite intertwined, i.e., if one is true
others are by definition NA
or have a required value. That kind of a relationship is a little tricky to express in a tidy way, especially as things evolve, so I need to think more about if column clustering should be formal or informal, or formalized in a higher-level abstraction, e.g., concepts.
Another question that came up in discussion with @xehu is something like "which columns matter when?"
In her case, she is taking in free text versions of several columns that are not required as machine readable at this time, e.g., context
where the response might be something like:
34 teams of 4 people based in a bike based tulip delivery startup in the Netherlands
This is in contrast to some of our other mapping efforts where the goal of commensurability has driven us to exhaustively decompose columns like context
, e.g., into team_count
, team_size
, type_of_flower
etc.
Of course, at a later time, the context
column might turn into a series of more specific ones, but having this less formal column makes encoding easier and aspects of the final column design more effectively asynchronous — we hopefully have the data to make more formal columns with the informal one.
One design pressure or consideration here might be to make sure people:
team_size
a straightforward extension, so that the captured data can quickly be upscaled ether at the time of recording or when columns are further detailedA related aspect reflected in this conversation was that a map may not be the desired output of a mapping process. In this case, the output is closer to a list of theory operationalizations within a certain domain. This is interesting because it effectively looks at only one dimension of the map at a time, which is not a view we have previously engaged with deeply (perhaps also relevant to views discussion #4).
Column design is one of the most important aspects of research cartography, in short, we need to find good ways to establish measures we care about, to score them, and to validate them. The process of designing columns is heavily iterative, and interconnected with other columns and their performance, as well as with the research direction that is motivating the cartographic effort (and any evolution of that research direction during the mapping process).
So, how is this done now?
It appears each project has had somewhat different approaches to designing and refining columns. At a high level:
Each step has opportunities for researcher degrees of freedom, and we would ideally like to make it possible to reduce those as much as possible. The validation and finalization steps are most critical here, because they determine when something will become a core part of our data. If we do those badly, we get bad data.
A related challenge is that even when validating correctly, we can also suffer from overfitting to the sample that was tested. So another consideration in this process it ensuring that the sampled stimuli (e.g., papers) are sufficiently distributed over the domain of candidates to be a good test of the measure.
Many open questions exist in this process but any thoughts from those actively mapping now would be very helpful. Also, if it's unclear what this is about, reading #1 might help, and asking questions in the comments below is always encouraged!