Open lemontheme opened 6 years ago
In practice, what we usually do is raw
unless there is a clear use case for external
. We generally don't restrict raw
to one dataset, which means we could put everything in raw
. That said, often we're asked to look at a "primary" dataset. Over the course of the project, we find other datasources that are relevant or that we want to look at including. Storing those in external
means at the end of the project we know which data sources were "provided" vs. which ones we found elsewhere.
Also, WRT to processing pipeline, I've seen both A and B in practice. It just depends on how much processing external
needs for your particular analysis.
For example, if we want to add geographic regions to a dataset, we need shapefiles for those regions. Often these come from third-party sources. We usually put these in external
, and they get used early on in the pipeline to augment the raw data B. We then may select only specific regions as part of interim -> processed
and these feed into the analysis.
On the other hand, we've done projects where we do something like comparing published country-level poverty rates to those calculated from a survey. In this case, the raw data in the survey gets aggregated to country-level estimates during interim -> processed
. We then directly compare these to the external
datasets that feed into the analysis.
TLDR; I would recommend everything in raw
unless there is a clear internal or "primary" dataset.
Does that help answer your question?
Thanks for the detailed response! That certainly clears up a lot.
As I understand it, then, the difference between external
and raw
doesn't relate so much to the question of where the data comes but more to a (variable) combination of the data's 'function' within and 'specificity' to the project in addition to the data's origin. Here's a truly humble attempt at visualizing what I mean: =p
Function?
+-----------------------------------+
| Central | Supporting |
+----------------+------------------+-----+
| raw raw | Yes |
| |-----| Project-specific?
| external external | No |
+-----------------------------------+-----+
Like I said: humble. (It would appear – in this analysis at least – that 'project-specificity' is the winning dimension.)
In any case, I found the following sentence to be particularly helpful, since it really struck a chord.
Storing those in external means at the end of the project we know which data sources were "provided" vs. which ones we found elsewhere.
It's such a simple idea, but I've lost count of how many times I've done the exact opposite, only to be confronted with the ugly consequences when revisiting the project months later.
While every use case is different, I would almost suggest incorporating it into the documentation somehow.
Changed title and clarified that this is an easy doc fix.
Based on the
<-
descriptions in the 'Directory structure' section of the documentation, there doesn't seem to be clear-cut criterion for choosing betweendata/external
anddata/raw
in those cases where the original data dump originates from a third-party source, i.e., fulfills the conditions for inclusion in either directory.What sort of criteria do you apply in such ambiguous cases?
On a related note, where does
data/external
fit in your 'mental model' of the preprocessing pipeline? (Pick one)