BCODMO / frictionless-usecases

Example datasets submitted to BCO-DMO
MIT License
2 stars 0 forks source link

No way to do equivalent of a "Full Outer" join? #12

Closed adyork closed 5 years ago

adyork commented 5 years ago

The way pipelines/dataflows architectured joins is pretty inflexible compared with join tools we were familiar with due to the "source" "target" logic and inability to achieve a full outer join.

This is my summary of the core of the issue as best I can tell. Please let me know if there is an existing join configuration other method I am missing here.

What we really want is the join types available in any other join utility I have ever used (e.g. sql, pandas, tableau, etc) with join types:

image image image image https://www.w3schools.com/sql/sql_join.asp

pipelines/dataflows has a Full: True|False option which from what I can tell gets us "inner", and "right" joins.

Left is impossible in the target/source model, but that isn't a big limitation since we can achieve the same results and with being careful about what we call the "source" and "target." Though I have run into workflow issue with the order dependency in general, see #11.

The biggest issue I can see is that I can't do a full outer join (meaning return all rows from either source regardless of whether there was a match).

Here is an example of why we want full outer.

You have a dataset with bottle samples and nutrients (resource: nutrients). You have an event log with the bottle_ids that you can join to get lat/lon/depth columns into your nutrient data.

In a perfect world, all bottle IDs would be in the event log data. In reality often what you are using as keys might have issues that cause no match in the "source."

Perfect world: All bottle_ids in nutrients data are also in event log

Reality: Missing key values in event log, or mismatches between key values (e.g. "statin 12" vs "station 12" or "12_a" vs "12a."

If we could do full outer join we could easily see where there are no matches in either data resource. With the existing join options you loose any unmatched values in the key of the event log (source) so can't tell what is going on, whether it is inconsistent naming, misspelling, or just missing values

adyork commented 5 years ago

@BCODMO/frictionlessdata-pilot-project, does this make sense to you? Am I missing a way to configure join step to do the equivalent of "outer?"

The above example is just one case we would want outer for catching errors in the key, but there are plenty of times you just need to do an outer join, not just for catching errors.

adyork commented 5 years ago

I added a new usecase which would need a full outer join. Hopefully this clarifies one need I described above. This is a complicated one, but skip to the notes in the README for data version 2, processing needs: https://github.com/BCODMO/frictionless-usecases/blob/master/usecases/water_column_particle_profiles/README.md#data-version-2

roll commented 5 years ago

The implementation at a review. To test it now:

pip install --upgrade git+https://github.com/roll/dataflows.git@full-outer-join
pip install --upgrade git+https://github.com/okfn/datapackage-pipelines.git@full-outer-join

E.g.

join:
  title: join
  description: "test join"
  pipeline:

  - run: load
    parameters:
      from: 'population.csv'
      name: population
      format: csv

  - run: load
    parameters:
      from: 'cities.csv'
      name: cities
      format: csv

  - run: join
    parameters:
      mode: full-outer
      source:
        name: population
        key: [id]
      target:
        name: cities
        key: [id]
      fields:
        population:
          name: population

  - run: dump.to_path
    parameters:
      resources: [cities]
      out-path: 'output'
      pretty-descriptor: true

cc @cschloer

adyork commented 5 years ago

OH great! We will test this out.

cschloer commented 5 years ago

@roll Are you waiting for us to review before merging into master? If so, could you pull master into the branch? I don't want to lose other changes that you've made to dataflows since creating the branch.

roll commented 5 years ago

@cschloer I'm going to make changes today/tomorrow asked by Adam (the DPP maintainer) and then it will be added/realeased

roll commented 5 years ago

It's released:

pip install --upgrade datapackage-pipelines==2.1.9
join:
  title: join
  description: "test join"
  pipeline:

  - run: load
    parameters:
      from: 'population.csv'
      name: population
      format: csv

  - run: load
    parameters:
      from: 'cities.csv'
      name: cities
      format: csv

  - run: join
    parameters:
      mode: full-outer
      source:
        name: population
        key: [id]
      target:
        name: cities
        key: [id]
      fields:
        population:
          name: population

  - run: dump.to_path
    parameters:
      resources: [cities]
      out-path: 'output'
      pretty-descriptor: true
adyork commented 4 years ago

Thanks again for implementing full outer joins. I used it today to join species identifiers into a dataset and am grateful!

I was able to see where there were umatched keys in both source and target now!

adyork commented 4 years ago

Hi @roll, thanks for adding the modes. We've been using them. However, it seems the modes are not in the documentation yet. https://github.com/frictionlessdata/datapackage-pipelines#join

@cschloer you can chime in but I think we had some questions about how this works.

roll commented 4 years ago

Hi @adyork,

https://github.com/datahq/dataflows/blob/master/PROCESSORS.md#joinpy

adyork commented 4 years ago

Great! Thanks.
image

I have some suggestions for clarifying if you think it would be helpful. I tried to keep to the source/target model language.

Essentially what I am trying to clarify is when there are key mismatches, whether the rows are dropped or added to the result with null values in fields that originated from either the source or target.

Also, it really matters when translating the LEFT/RIGHT join model to the target/source model to clarify which is considered the "target." Essentially, you can flip flop what you can consider a left or right join if you switch which resource is the source or target.

adyork commented 4 years ago

Also, if people seem to be still confused about joins and modes, what I have found useful when explaining this to people is to show them these types of circle diagrams. You could adapt these join-type circle diagrams from the SQL model to use the source/target model terms by replacing the table1&2 with source&target. image

https://www.w3schools.com/sql/sql_join.asp