Closed adyork closed 5 years ago
@BCODMO/frictionlessdata-pilot-project, does this make sense to you? Am I missing a way to configure join step to do the equivalent of "outer?"
The above example is just one case we would want outer for catching errors in the key, but there are plenty of times you just need to do an outer join, not just for catching errors.
I added a new usecase which would need a full outer join. Hopefully this clarifies one need I described above. This is a complicated one, but skip to the notes in the README for data version 2, processing needs: https://github.com/BCODMO/frictionless-usecases/blob/master/usecases/water_column_particle_profiles/README.md#data-version-2
The implementation at a review. To test it now:
pip install --upgrade git+https://github.com/roll/dataflows.git@full-outer-join
pip install --upgrade git+https://github.com/okfn/datapackage-pipelines.git@full-outer-join
E.g.
join:
title: join
description: "test join"
pipeline:
- run: load
parameters:
from: 'population.csv'
name: population
format: csv
- run: load
parameters:
from: 'cities.csv'
name: cities
format: csv
- run: join
parameters:
mode: full-outer
source:
name: population
key: [id]
target:
name: cities
key: [id]
fields:
population:
name: population
- run: dump.to_path
parameters:
resources: [cities]
out-path: 'output'
pretty-descriptor: true
cc @cschloer
OH great! We will test this out.
@roll Are you waiting for us to review before merging into master? If so, could you pull master into the branch? I don't want to lose other changes that you've made to dataflows since creating the branch.
@cschloer I'm going to make changes today/tomorrow asked by Adam (the DPP maintainer) and then it will be added/realeased
It's released:
pip install --upgrade datapackage-pipelines==2.1.9
join:
title: join
description: "test join"
pipeline:
- run: load
parameters:
from: 'population.csv'
name: population
format: csv
- run: load
parameters:
from: 'cities.csv'
name: cities
format: csv
- run: join
parameters:
mode: full-outer
source:
name: population
key: [id]
target:
name: cities
key: [id]
fields:
population:
name: population
- run: dump.to_path
parameters:
resources: [cities]
out-path: 'output'
pretty-descriptor: true
Thanks again for implementing full outer joins. I used it today to join species identifiers into a dataset and am grateful!
I was able to see where there were umatched keys in both source and target now!
Hi @roll, thanks for adding the modes. We've been using them. However, it seems the modes are not in the documentation yet. https://github.com/frictionlessdata/datapackage-pipelines#join
@cschloer you can chime in but I think we had some questions about how this works.
Great! Thanks.
I have some suggestions for clarifying if you think it would be helpful. I tried to keep to the source/target model language.
Essentially what I am trying to clarify is when there are key mismatches, whether the rows are dropped or added to the result with null values in fields that originated from either the source or target.
Also, it really matters when translating the LEFT/RIGHT join model to the target/source model to clarify which is considered the "target." Essentially, you can flip flop what you can consider a left or right join if you switch which resource is the source or target.
Also, if people seem to be still confused about joins and modes, what I have found useful when explaining this to people is to show them these types of circle diagrams. You could adapt these join-type circle diagrams from the SQL model to use the source/target model terms by replacing the table1&2 with source&target.
The way pipelines/dataflows architectured joins is pretty inflexible compared with join tools we were familiar with due to the "source" "target" logic and inability to achieve a full outer join.
This is my summary of the core of the issue as best I can tell. Please let me know if there is an existing join configuration other method I am missing here.
What we really want is the join types available in any other join utility I have ever used (e.g. sql, pandas, tableau, etc) with join types:
https://www.w3schools.com/sql/sql_join.asp
pipelines/dataflows has a Full: True|False option which from what I can tell gets us "inner", and "right" joins.
Left is impossible in the target/source model, but that isn't a big limitation since we can achieve the same results and with being careful about what we call the "source" and "target." Though I have run into workflow issue with the order dependency in general, see #11.
The biggest issue I can see is that I can't do a full outer join (meaning return all rows from either source regardless of whether there was a match).
Here is an example of why we want full outer.
You have a dataset with bottle samples and nutrients (resource: nutrients). You have an event log with the bottle_ids that you can join to get lat/lon/depth columns into your nutrient data.
In a perfect world, all bottle IDs would be in the event log data. In reality often what you are using as keys might have issues that cause no match in the "source."
Perfect world: All bottle_ids in nutrients data are also in event log
Reality: Missing key values in event log, or mismatches between key values (e.g. "statin 12" vs "station 12" or "12_a" vs "12a."
If we could do full outer join we could easily see where there are no matches in either data resource. With the existing join options you loose any unmatched values in the key of the event log (source) so can't tell what is going on, whether it is inconsistent naming, misspelling, or just missing values