datasets / s-and-p-500-companies

List of companies in the S&P 500 together with associated financials
https://datahub.io/core/s-and-p-500-companies
499 stars 491 forks source link

Simplify data validation done in data_test.py #21

Closed noahg closed 6 years ago

noahg commented 6 years ago

data_test.py step is failing because the pipeline class is no longer a part of the goodtables package. I found the datapackage-pipelines project which appears to be targeting a different use case than one-off simple validation.

I simply created https://github.com/noahg/s-and-p-500-csv/blob/master/scripts/validate.py as a quick way to check the validity of the newly generated csv (that it conforms to the datapackage.json).

My question, perhaps for @zelima , would my validate.py script suffice for this project going forward?

It's not clear to me what the organization's preference would be as I'm finding varying validation steps (or non at all) across other more recently updated datasets. Thanks!

rufuspollock commented 6 years ago

@Mikanebu can you check in with @zelima and get an answer here.

zelima commented 6 years ago

@noahg Thanks for raising this. Generally speaking, goodtables has released new version recently and has changed significantly - we have to review all our datasets and see if there are failing tests like this elsewhere.

As an answer to your question - I'm not aware if there is any prefered approach testing validity of the datasets at this moments, so think yes - validate.py script would suffice.

A small comment about your code - why not to use goodtables validate method instead and make code even simpler?

from goodtables import validata

validation_report = validate(
    '../datapackage.json', preset='datapackage')
noahg commented 6 years ago

Great, I'll proceed then by following your example validation. Thanks for the tip on using the validate method. Not sure how I overlooked that!