[DOCS] Clarify difference in function between data/external and data/raw

lemontheme commented 6 years ago

Based on the <- descriptions in the 'Directory structure' section of the documentation, there doesn't seem to be clear-cut criterion for choosing between data/external and data/raw in those cases where the original data dump originates from a third-party source, i.e., fulfills the conditions for inclusion in either directory.

What sort of criteria do you apply in such ambiguous cases?

On a related note, where does data/external fit in your 'mental model' of the preprocessing pipeline? (Pick one)

A

raw ---> interim ---> processed ---+
                                   |
                                   +---> [ analysis ]
                                   |
                      external ----+

B

raw -------+
           |
           +--> interim ---> processed ---> [ analysis ]
           |
external --+

pjbull commented 6 years ago

In practice, what we usually do is raw unless there is a clear use case for external. We generally don't restrict raw to one dataset, which means we could put everything in raw. That said, often we're asked to look at a "primary" dataset. Over the course of the project, we find other datasources that are relevant or that we want to look at including. Storing those in external means at the end of the project we know which data sources were "provided" vs. which ones we found elsewhere.

Also, WRT to processing pipeline, I've seen both A and B in practice. It just depends on how much processing external needs for your particular analysis.

For example, if we want to add geographic regions to a dataset, we need shapefiles for those regions. Often these come from third-party sources. We usually put these in external, and they get used early on in the pipeline to augment the raw data B. We then may select only specific regions as part of interim -> processed and these feed into the analysis.

On the other hand, we've done projects where we do something like comparing published country-level poverty rates to those calculated from a survey. In this case, the raw data in the survey gets aggregated to country-level estimates during interim -> processed. We then directly compare these to the external datasets that feed into the analysis.

TLDR; I would recommend everything in raw unless there is a clear internal or "primary" dataset.

Does that help answer your question?

lemontheme commented 6 years ago

Thanks for the detailed response! That certainly clears up a lot.

As I understand it, then, the difference between external and raw doesn't relate so much to the question of where the data comes but more to a (variable) combination of the data's 'function' within and 'specificity' to the project in addition to the data's origin. Here's a truly humble attempt at visualizing what I mean: =p

             Function?
+-----------------------------------+
|    Central     |    Supporting    |
+----------------+------------------+-----+
|  raw                raw           | Yes |
|                                   |-----|  Project-specific?
|  external           external      | No  |
+-----------------------------------+-----+

Like I said: humble. (It would appear – in this analysis at least – that 'project-specificity' is the winning dimension.)

In any case, I found the following sentence to be particularly helpful, since it really struck a chord.

Storing those in external means at the end of the project we know which data sources were "provided" vs. which ones we found elsewhere.

It's such a simple idea, but I've lost count of how many times I've done the exact opposite, only to be confronted with the ugly consequences when revisiting the project months later.

While every use case is different, I would almost suggest incorporating it into the documentation somehow.

isms commented 5 years ago

Changed title and clarified that this is an easy doc fix.

drivendataorg / cookiecutter-data-science

[DOCS] Clarify difference in function between data/external and data/raw #136