diachron / quality

Dataset Quality Assessment (part of WP5 of the Diachron EU FP7 project)
MIT License
8 stars 4 forks source link

Consistent metric names #51

Open nfriesen opened 10 years ago

nfriesen commented 10 years ago

Looking over the metric implementation for cleaning purposes I noticed some inconsistences in metric names:

  1. In the most cases the metric name reflects the corresponding quality problem, so 'UndefinedClasses', or 'DuplicateInstances' or 'ObsoleteConceptInOtology'. However, some metrics have different meaning - LowUsageOfBlankNodes, Even it's not possible for all metrics, In some cases it makes more sence to reflect the quality problems, e.g. BlankNodeUsage. Some metric names are confusing, they don't reflect the metric definition. It would be helpful either adapt their implementation (if exactly this metric is required for use case) or rename them. This is the list of such metrics:
    • ShortURIs metric actually computes the average URU length, so maybe (AverageURILength) ?
    • LowBlankNodeUsage metric actually computes the ratio of 'good' entities. - the current implementation computes NoBlankNodesRatio, but it would makes more sence to define BlankNodesRatio.
    • Metric UnstructuredData probably shoulb be separated into the two metrics: UnstructuredData and DeadURIs metrics

@jerdeb BTW the test for Dereferencability metric fails.

jerdeb commented 10 years ago

Most of those are specific to EBI use cases, therefore are not generic. They are marked as so. Also, there are a number of metrics which need to be reviewed after the deliverable. Keep this ticket open.

On 24 July 2014 11:26, Natalja Friesen notifications@github.com wrote:

Looking over the metric implementation for cleaning purposes I noticed some inconsistences in metric names:

  1. In the most cases the metric name reflects the corresponding quality problem, so 'UndefinedClasses', or 'DuplicateInstances' or 'ObsoleteConceptInOtology'. However, some metrics have different meaning - LowUsageOfBlankNodes, Even it's not possible for all metrics, In some cases it makes more sence to reflect the quality problems, e.g. BlankNodeUsage. Some metric names are confusing, they don't reflect the metric definition. It would be helpful either adapt their implementation (if exactly this metric is required for use case) or rename them. This is the list of such metrics:
    • ShortURIs metric actually computes the average URU length, so maybe (AverageURILength) ?
    • LowBlankNodeUsage metric actually computes the ratio of 'good' entities. - the current implementation computes NoBlankNodesRatio, but it would makes more sence to define BlankNodesRatio.
    • Metric UnstructuredData probably shoulb be separated into the two metrics: UnstructuredData and DeadURIs metrics

@jerdeb https://github.com/jerdeb BTW the test for Dereferencability metric fails.

— Reply to this email directly or view it on GitHub https://github.com/diachron/quality/issues/51.

nfriesen commented 10 years ago

@jerdeb Please check the test for Dereferencability metric, it fails.