diachron / quality

Dataset Quality Assessment (part of WP5 of the Diachron EU FP7 project)
MIT License
8 stars 4 forks source link

(Metric Impl) Check to what extent the dataset reuses reputable datasets/ontologies #21

Closed jerdeb closed 10 years ago

jerdeb commented 10 years ago

This is a Reputation Dimension and a ComplexQualityMetric.

For this metric we need to check if the dataset resources are hosted in a reputable source.

The list of reputable sources should be "loaded" in the before method.

The reputable sources are (links to owl files): Ontology for Biomedical Investigations http://purl.obolibrary.org/obo/obi.owl Cell Type Ontology http://purl.obolibrary.org/obo/cto.owl Gene Ontology http://purl.obolibrary.org/obo/go.owl PATO http://purl.obolibrary.org/obo/pato.owl ChEBI http://purl.obolibrary.org/obo/chebi.owl ORDO http://www.orphadata.org/data/ORDO/ordo_orphanet.owl.zip (note this one is zipped) IAO http://purl.obolibrary.org/obo/iao.owl NCBI Taxon http://purl.obolibrary.org/obo/ncbitaxon.owl (warning, this is a very big file!) Uberon http://purl.obolibrary.org/obo/uberon.owl Unit Ontology http://purl.obolibrary.org/obo/uo.owl Software Ontology http://sourceforge.net/projects/theswo/files/SWO%20ontology%20release (sorry zipped again)

clange commented 10 years ago

@muhammadaliqasmi please do not yet start implementing; @jerdeb could you please clarify the following things?

Some questions:

muhammadaliqasmi commented 10 years ago

Reputation of dataset metric is already implemented and source code has been pushed into the master branch.

For this metric all the reputable resources are needed to be downloaded and saved in ../src/main/resources/reputable directory (manually).

All URIs (with path) of resources found in these resources are loaded in a HashSet before the execution of metric computation.

In Metric computation the URIs (with path) of resources found in the given quad are compared with list of URIs in the HashSet. If they are not found then these resources are marked as not reputable.

Metric Value = (Number of NOT reputable resources) / (Total Number of Resources)

Metric Value Range = [0 - 1], Best Case = 0, Worst Case = 1