freme-project / technical-discussion

This repository is used for technical discussions.
2 stars 0 forks source link

Documentation of datasets #56

Closed jnehring closed 8 years ago

jnehring commented 9 years ago

I think we should provide more information about our datasets and about how they were created.

  1. It should contain some information about the structure / information of the dataset. I do believe that it is hard to use a long list of RDF triples without any explanations. This will make our datasets easier to use.
  2. It should contain information about the conversion process. It should also reference to any scripts / mapping files (e.g. to a dedicated github repository) that were used during conversion. When the datasets are updated then they have to be converted again and these information will safe a lot of work.

There is some overlap to the documentation homepage and the description field in the dataset API. We could setup a dedicated GitHub repository with a folder for each dataset containing a Readme.md file with above information and also the scripts. In the description field of each dataset we have a short text about the dataset. This short text also contains a link to the Readme.md file.

@m1ci what do you think?

m1ci commented 9 years ago

Every dataset created within the project is going to be described and published at datahub.io under the freme-project tag. See http://datahub.io/dataset?tags=freme-project

We can have a dedicated page like http://www.freme-projects.eu/datasets but I think it is still early for it. We work on conversion of CORDIS FP7 which is very relevant, so after we have it, and maybe also some of the UNdata datasets, then we can create some dedicated page.

It should contain some information about the structure / information of the dataset. I do believe that it is hard to use a long list of RDF triples without any explanations. This will make our datasets easier to use.

See the descriptions at http://datahub.io/dataset?tags=freme-project and let us know if we should extend them.

It should contain information about the conversion process. It should also reference to any scripts / mapping files (e.g. to a dedicated github repository) that were used during conversion. When the datasets are updated then they have to be converted again and these information will safe a lot of work.

There is some overlap to the documentation homepage and the description field in the dataset API.

what do you mean by dataset API?

We could setup a dedicated GitHub repository with a folder for each dataset containing a Readme.md file with above information and also the scripts.

Hm... its difficult to maintain datasets descriptions on several places. Lets decide one and maintain that. Github is fine, but its not visible place. I suggest we focus on datahub.io, since is widely used search place for datasets. In the FREME documentation and FREME NER datasets API we can provide brief (few sentences description for each dataset). For more info about the information, the users can visit the datahub.io page. FYI, we also plan to provide machine-readable description with DataID documents for each dataset.

jnehring commented 9 years ago

I think it is a good idea to keep the main documentation of datasets on datahub.io. Then we can also abandon the idea of a Readme file.

what do you mean by dataset API?

We have fields label and description for datasets. I actually in the "description" field we should at least link to datahub.

What I am still missing in your comments is how to repeat the conversion of a dataset. I think the process right now is that datasets are converted once and when we need to do it again in a year then we have to start again from scratch.

m1ci commented 9 years ago

We have fields label and description for datasets. I actually in the "description" field we should at least link to datahub.

Will include datahub.io links in the description fields. @nilesh-c can you please add the links to the FREME NER datasets descriptions? 1) link to http://datahub.io/dataset/orcid-dataset in the description of ORCID dataset 2) link to http://datahub.io/dataset/global-airports-in-rdf in the description of global-airports dataset

What I am still missing in your comments is how to repeat the conversion of a dataset. I think the process right now is that datasets are converted once and when we need to do it again in a year then we have to start again from scratch.

Will document in Google doc and provide this info for the ORCID, DBpedia abstracts, Statbel and any following converted datasets.

jnehring commented 8 years ago

Moved to https://github.com/freme-project/dataset-conversion-scripts/issues/1