frictionlessdata / tableschema-js

A JavaScript library for working with Table Schema.
http://frictionlessdata.io/
MIT License
82 stars 27 forks source link

How to collect all cast error including unique constrain violations? #122

Closed ghost closed 6 years ago

ghost commented 6 years ago

Validation for unique constraints doesn't appear to be implemented? Could this be added to upcoming release or is the validation performed elsewhere?

roll commented 6 years ago

It's implemented on the Table level (test example):

Could you provide a data example where it doesn't work?

Stephen-Gates commented 6 years ago

Using the library via Data Curator v0.8.0 I get...

image

roll commented 6 years ago

Could you share the descriptor created by Data Curator?

  const data = [
    ['a', 'b', 'c'],
    ['1', '2', '3'],
    ['1', '3', '2'],
    ['2', '4', '1'],
  ]
  const schema = {
    fields: [
      {name: 'a', type: 'integer', constraints: {unique: true}},
      {name: 'b', type: 'integer'},
      {name: 'c', type: 'integer'},
    ]
  }
  const table = await Table.load(data, {schema})
  await table.read()
 $ node tmp/unique.js
TableSchemaError: Row 3 has an unique constraint violation in column "a"
Stephen-Gates commented 6 years ago

Sorry got to run for the bus but here's an ugly version

{"profile":"tabular-data-package","resources":[{"profile":"tabular-data-resource","encoding":"utf-8","schema":{"fields":[{"name":"a","type":"integer","format":"default","constraints":{"unique":true}},{"name":"b","type":"integer","format":"default"},{"name":"c","type":"integer","format":"default"}],"missingValues":[""]},"format":"csv","mediatype":"text/csv","sources":[{"title":"","path":"","email":""}],"name":"test","path":"data/test.csv"}],"sources":[{"title":"","path":"","email":""}],"name":"test"}

roll commented 6 years ago

I can't reproduce:

  const descriptor = {
    "profile": "tabular-data-package",
    "resources": [
      {
        "profile": "tabular-data-resource",
        "encoding": "utf-8",
        "schema": {
          "fields": [
            {
              "name": "a",
              "type": "integer",
              "format": "default",
              "constraints": {
                "unique": true
              }
            },
            {
              "name": "b",
              "type": "integer",
              "format": "default"
            },
            {
              "name": "c",
              "type": "integer",
              "format": "default"
            }
          ],
          "missingValues": [
            ""
          ]
        },
        "format": "csv",
        "mediatype": "text/csv",
        "name": "test",
        "path": "tmp/unique.csv"
      }
    ],
    "name": "test"
  }
  const dp = await Package.load(descriptor, {basePath: '.'})
  await dp.getResource('test').read()
$ node tmp/unique.js
TableSchemaError: Row 3 has an unique constraint violation in column "a"
Stephen-Gates commented 6 years ago

@mattredbox can you look on the Data Curator side to make sure where not dropping the returned error?

Thanks @roll

ghost commented 6 years ago

Hi @Stephen-Gates and @roll I noticed in your example @roll from the unit test that you use the function: await table.read() Similarly when I use this in the Data-Curator, I can see the unique contraint violation method.

However, in order to interrogate errors in rows and build error message for each, following your doco, we're creating a stream iterator: const stream = await table.iter({extended: true, stream: true, cast: false}) and then using the castRow function to interrogate each row: schema.castRow(row) Can you confirm whether or not we should be able to get unique contraint violation from this (as we can get for example other constraints such as minimum value, required) - as perhaps we're not using these methods correctly, or why unique constraints violation shouldn't be available to us here?

roll commented 6 years ago

@mattRedBox I see - that's the gotcha schema.castRow(row) do know nothing about previously cast rows. So there is only one constrain which exists only on Table level - unique constraint.

We need to think how to resolve it. In high-level here is a right way - https://github.com/frictionlessdata/tableschema-js/issues/112 - but may be there is a quick fix.

pwalsh commented 6 years ago

I am not sure this is a gotcha exactly - unique validation requires reading the table to check uniqueness.

@roll I guess one kind of quick option would be to allow some configuration where UniqueFieldsCache can be a global variable, with the gotcha that the user of the library needs to manage the state of this variable.

ghost commented 6 years ago

Thanks @roll and @pwalsh Understood - and yep - all makes sense. We first started using casting rows as when couldn't see how to collect all errors from table.read() - I like the idea of something like a 'force parse' option 👍 Obviously, for us, we can't use both table.read() and schema.castRow(row) as while we could collect unique errors (as the messages seem to be consistent across both methods) - as a quick workaround for now, I'm not sure how to guarantee that we always get unique constraints, and not some other some other error, from table.read()

roll commented 6 years ago

@pwalsh I thought about it a little bit, and for now, I don't see that trying to hack it somehow could be the best strategy. It seems the proper solution could take the same amount of time.

Here - https://github.com/frictionlessdata/tableschema-js/blob/master/src/table.js#L62-L168 - we just need:

In this mode, a client should just check for returning row type. If it's not an array but an Error it means there are cast errors in the row. To get the errors the client just should access error.errors. And that's exactly what Data Curator needs.

tabulator returns an empty row - https://github.com/frictionlessdata/tabulator-py#force-parse - but it's because there is no need to collect errors.

@mattRedBox Would you like to try to PR this? It seems it's just moving a small chunk of the Data Curator code to the tableschema lib.

ghost commented 6 years ago

thanks @roll yes, happy to. Just need to check with @Stephen-Gates first, where we fit this into current development

Stephen-Gates commented 6 years ago

@mattRedBox Data Curator Issue #311 is in milestone v0.9.0