datopian / datahub-qa

:package: Bugs, issues and suggestions for datahub.io
https://datahub.io/
32 stars 6 forks source link

Publishing a large Linked Open Data project - how to? #147

Closed lambdamusic closed 6 years ago

lambdamusic commented 6 years ago

Hi,

I'm part of the SciGraph project and would like to make available our data (~200G) via datahub.io.

I have two questions:

Thanks in advance!

AcckiyGerman commented 6 years ago

@lambdamusic

can I create a record for a dataset hosted elsewhere? if yes how?

Yes, that is very easy: you just create an datapackage.json file that describes your data (this is a simple example):

{
  "name": "name-of-the-datapackage",
  "resources": [
    {
      "path": "https://yourdomain.com/yourdata.csv",
      "pathType": "remote",
      "name": "remote-data-about-something",
      "format": "csv",
      "schema": {
        "fields": [
          {
            "name": "number",
            "type": "integer",
            "format": "default"
          },
          {
            "name": "string",
            "type": "string",
            "format": "default"
          },
          {
            "name": "boolean",
            "type": "boolean",
            "format": "default"
          }
        ],
        "missingValues": [
          ""
        ]
      }
    }
  ]
}

Then you run the command data push and the dataset will be uploaded on the datahub.io

To avoid typing this file manually, you can use our data-cli tool (type command data init in the folder where you store the data) to infer the data structure into that file, and then do needed fixes (in your case - replace the path of data files). You can read more about datapackages structure and data-cli tool on http://datahub.io/docs Or I could help you, if you ping me @acckiygerman in this channel: https://gitter.im/datahubio/chat

AcckiyGerman commented 6 years ago

what is the relationship between datahub and the LOD cloud (http://lod-cloud.net/)?

Personally I don't know. Probably they used our old datahub.io site as a source of their data, may be @zelima could answer?

zelima commented 6 years ago

@AcckiyGerman @lambdamusic I don't obtain any information about this.

Looking at the diagram and playing around with it a bit: in most cases, URLs are redirected to old.datahub.io. Also, it is last updated in 2017-08-22. At that point, current datahub was not live yet. I assume these guys are using old.datahub.io as a source for their project. Don't think there's more relation between them and datahub.io.

Anyway, think the best place to get correct information about stuff like this is https://gitter.im/datahubio/chat

lambdamusic commented 6 years ago

Awesome guys. Thanks very much. I'll play with the CLI and see how far I can get.

zelima commented 6 years ago

@lambdamusic I just read instructions for publishing data from @AcckiyGerman and while it's complete and quite accurate, alternatively you could simply run this and it will get published.

data push path/or/url/to/my/file[.ext]

datapackage.json and data init (creates datapackage.json) is something for describing your data in the best way. Eg you could include some key metadata like a description of your dataset, licence, contributors, encoding, views etc... You can read more about data package specifications here https://frictionlessdata.io/specs/data-package/#specification