datasets / awesome-data

Curated list of quality open datasets
https://datahub.io/collections
764 stars 94 forks source link

Add unit tests and continuous integration #114

Open femtotrader opened 8 years ago

femtotrader commented 8 years ago

Hello,

when datasets/registry will be a DataPackage it will be a good idea to ensure that every URL are available and requests returns a HTTP status code == 200.

Such a test could be done using python and requests (see some sample code https://github.com/datasets/registry/issues/112 )

but a more rigorous approach (maybe in a second time) could be to ensure that they are "valid" DataPackages.

It will avoid to add bad DataPackages url to this repository.

Kind regards

rufuspollock commented 8 years ago

Great suggestion!

danfowler commented 8 years ago

👍

Lots of datasets in this organization are not quite valid for one reason or another. It would be good to get some validation in place.

Going even further: https://github.com/frictionlessdata/ex-continuous-data-integration

femtotrader commented 8 years ago

The "problem" here is that each repository is responsible of testing if its data are valid or not.

You have to visit each repository to see if it's valid or not.

If code of validator or datapackage spec change you have to run CI in EACH repository. It may be quite long.

I think we should be able to download a lot of datapackage locally (for example all https://github.com/datasets/registry ) so a cache mechanism is something very important https://github.com/frictionlessdata/datapackage-py/issues/72 ) and run validation with cached datapackages.

We will only have to use CI in ONE repository which will be responsible of testing if datapackage are valid (or not)

rufuspollock commented 8 years ago

@danfowler I have ideas / plans as to how to do this. However, want to do this as part of the systematic infrastructure upgrade we are planning here ;-)

femtotrader commented 8 years ago

Some (quick and dirty) code that might help.

from requests import Session
from unittest import TestCase

import re
import datapackage

pattern = re.compile("https:\/\/github\.com\/(.*)\/(.*)")

def fix_url(url, pattern):
    m = re.search(pattern, url)
    if m is not None:
        owner, repository = m.groups()
        return "https://raw.githubusercontent.com/%s/%s/master/datapackage.json" % (owner, repository)
    else:
        return url

class TestDatasets(TestCase):
    def setUp(self):
        self.session = Session()

    def test_datasets(self):
        url_registry = "https://github.com/datasets/registry"
        url_registry = fix_url(url_registry, pattern)
        dp_registry = datapackage.DataPackage(url_registry)
        print(url_registry)
        dp_registry.validate()

        for resources in dp_registry.resources:
            for data in resources.data:
                url = data["url"]
                url = fix_url(url, pattern)
                dp = datapackage.DataPackage(url_registry)
                print(url)
                dp.validate()

that can be run using

$ nosetests -s -v tests/test_dp.py

but there are 2 issues:

rufuspollock commented 8 years ago

@femtotrader that's amazing - thanks!