NCEAS / metadig-checks

MetaDIG suites and checks for data and metadata improvement and guidance.
Apache License 2.0
9 stars 9 forks source link

Review check "SKIP" and "OPTIONAL" behaviour/policy #149

Closed gothub closed 4 years ago

gothub commented 5 years ago

After having a look at a couple of quality reports and reviewing the scores and checks for Interoperable for a recent FAIR suite run, it appears that a high score can be obtained as a result of the 'SKIP' policies and 'OPTIONAL' type checks.

A check can be skipped if the check determines that the check doesn't apply to the metadata. For example, if no attributes are in the metadata, an attributes based check will will return 'SKIP' as the status, and this check will not be used in the score calculation.

If a check has the level 'OPTIONAL', then if the check fails it will return a 'WARNING' status and not be used in the score calculation. For the v0.2.0 FAIR suite run, the pid 'cc2b7b7d3f13c564fbd952886f1e5d2d', from PANGAEA, got a score of '0.83' for Interoperable. For the Interoperable checks, 5 were successful, 4 were skipped and 3 failed, but 2 of the failed checks was optional, so only one of the failed checks contributes to the score.

So the score calculation was (checks passed) / (total checks), or 5/5+1, or 0.83, although the pid only passed 5 of 12 checks.

IIRC the original intent of the 'skip' and 'optional' mechanisms was focused on the user improving there metadata, especially for metadata only datasets (no data entities) and so they wouldn't be marked down for not having entities, attributes.

We should evaluate this policy and determine if it is at odds with suites like FAIR, where the over score can be compared between repositories, dialects and suite versions.

For FAIR, should the metadata be scored lower if no entities or attributes are entered?

See https://github.com/NCEAS/metadig-checks/issues/148 for a list of FAIR checks and their associated category and level.

jeanetteclark commented 5 years ago

I think you summed up the issue nicely. This is the scenario I think we are hoping to avoid, and which I think we might be seeing:

If you do have entity/attribute information, but fail one the required attribute checks such as entity.attributeStorageType.present.1 I'd bet that you could easily end up with a lower score than a similar dataset which had no attributes at all.

--

gothub commented 5 years ago

So these high Interoperable scores may be inappropriately compared to 'truer' high scores. For example, the ADC pid doi:10.18739/A2BN9X32Q got a score of '0.80' but it had the following: 8 successful checks, 4 failed checks (2 of which are optionsl), so the score was 8/10 (0.80).

This pid scored lower than the first example ISO pid, although it passed 3 more checks.

gothub commented 5 years ago

The following FAIR checks have a 'SKIP' condition, where the check will be skipped, not scored, if the element type to be checked is not present. Notice that there are no 'Accessible' skip conditions:

Findable
dataset.keywords.controlled.xml
dataset.keywordType.present.xml
dataset.naturalLanguageKeywords.present.xml

Interoperable
entity.attributeName.differs.xml
entity.attributeNames.unique.xml
entity.attributeDefinition.present.xml
entity.attributeDefinition.sufficient.xml
entity.attributeStorageType.present.xml
entity.name.present.xml

Reusable
entity.attributeDomain.present.xml
entity.attributeUnits.present.xml
entity.attributeMeasurementScale.present.xml
entity.attributePrecision.present.xml
entity.description.present.xml
gothub commented 5 years ago

From @mbjones (on slack):

I think the score should be lower

SKIP in the arctic suite was first used to be sure we weren’t flagging issues that users had no power over. Mainly entering attribute metadata whic at the time was not possible in the editor. Now that it is, those should become required for ADC. And I don’t think SKIP makes much sense for FAIR at all.

amoeba commented 5 years ago

This is super tough. The way I read it, we're struggling to know whether a metadata record should have an element. EML doesn't handle this well, whereas ISO19115 does via nilReason. Scoring issues like this are akin to fixing a broken test in a test suite by removing tests.

For checks that can be conditionally skipped when elements are not found (e.g., entity, attribute), we probably need to bring in more information. This sounds impossible or at least really hard so long as metadig is oriented towards a single-metadata-record design.

If we were scoring an entire Data Package (or some other container), we could know whether there should be entity and attribute descriptions in the metadata record because we could scan the data. This effectively makes a lot of these checks into congruence checks.

I think our model, at the moment, is that we don't want to create checks that penalize datasets for not having information they shouldn't have in the first place. For example, we don't want to mark a dataset with a PDF file down for not having attributes. Or we don't want to mark down a dataset that isn't experimental for not having methods.

But for the ADC, for example, maybe we could flip these on their heads: Assume all datasets should have at least one entity with at least one attribute, at least one keyword, with all the usual best-practice metadata for each. This will dramatically lower the ADC's average score but will, relatively, improve records our data team has curated.

jeanetteclark commented 5 years ago

Yeah, I think this issue will be very difficult to get around without knowing what the contents of the data package actually are. You make some good points Bryce.

Regarding the ADC - it's true that most of our datasets need attributes, but not all of them. We get lots of images that attributes aren't relevant for, conference reports, social science data that is just a metadata record, etc.

Honestly, I think that the best middle ground might just be to make all of the entity/attribute checks optional. That way they are counted towards your score if you hit them, but you aren't unfairly penalized for passing all but one.

mbjones commented 5 years ago

Totally see the arguments being made. But, not looking for attributes is weak. They should be required for any type of tabular data, or vector GIS data that has a dbf file. Anything less is really problematic. And we should be able to come up with an algorithm based on the entity type that helps determine whether attributes should be required.

gothub commented 4 years ago

@mbjones @jeanetteclark @dmullen17 @bryce i'm updating the FAIR suite this week and would like to determine if and when checks should be skipped. To review, some of the checks look for certain elements and if they are not present, the check is marked as 'SKIP' and is not including in the scoring calculation. The checks that perform skips are listed earlier in this issue.

So, any additional thoughts on:

mbjones commented 4 years ago

In general, I don't think we should ever SKIP any checks. Either a check represents a viable concept, or it doesn't. Any check that is not represented in a given dialect should fail rather than being skipped, but the concepts should be general concepts. At a minimum, every skipped check should at least be changed to optional. Here's my initial take on the checks you list above, although I didn't dive into the details on some of these so certainly open to discussion.

Findable

Interoperable

Reusable

[ These changes were made in commit 4d71c3c07d0de2750bee33f242d649c9448f10c8 ]

mbjones commented 4 years ago

Oh, and for the different entity types, I think attributes should be required for dataTable, but optional for the other entity types.

jeanetteclark commented 4 years ago

I mostly agree with Matt's comments above. I do have one question though about keywords. If we have the following two checks that are required:

dataset.keywords.controlled.xml REQUIRED dataset.naturalLanguageKeywords.present.xml REQUIRED

Does that mean a dataset will get dinged if doesn't have BOTH controlled and natural language keywords? I'm not sure that is what we want.

gothub commented 4 years ago

Currently the quality engine doesn't support comparing the results of multiple checks, as each check outcome is determined independently of any other check. So yes, a dataset will get dinged if both controlled and uncontrolled keywords are required and it only has one of them, as one of the checks will fail, if both are REQUIRED.

It seems that we need to re-evaluate the concept of 'OPTIONAL' checks for FAIR.

mbjones commented 4 years ago

For the keywords question, I agree its odd to require both. Maybe those checks should be refactored into two new checks:

[ this change was made in commit b86210bc8e938152dd1a4552fad171dc107699a1 and 4aa331177892133602ab1599533cc7714d429287 ]

jeanetteclark commented 4 years ago

I like that solution for the keywords @mbjones.

Regarding the attribute checks for different entities - I agree that they should be required for dataTable, they should maybe also be required for spatialVector.

I'm guessing this might mean that the series of entity.attribute... checks will need to be refactored into something like entity.dataTable.attribute...for each of: dataTable, otherEntity, spatialVector, spatialRaster, view, and storedProcedure, so 12 checks are going to turn into 70!

gothub commented 4 years ago

@mbjones refactoring the keyword checks as described makes sense.

@jeanetteclark regarding the proposed new check names - this could certainly make sense for the ADC suite, but tying the checks to particular EML data entity types for the FAIR suite doesn't work for ISO.

jeanetteclark commented 4 years ago

Ah yeah, that makes sense. Is it possible to implement optional vs required based on entity type as the checks are written now?

gothub commented 4 years ago

The 'required' vs 'optional' attribute of a check is set in the suite XML file and simply causes the check result to be included or not in the check calculation. It's significant that this is set in the suite XML, because it's really easy to change (by me or someone else). The required or optional attribute cannot be set inside a check, based on conditions that the check can evaluate.

The 'SKIP' mechanism is implemented within a check. If the check determines that certain conditions are met, the check status is marked as 'SKIP' and that check result is not included in the quality score. Since SKIP is implemented in the check, it's harder to change.

So, I could test entity types and SKIP the attribute checks for certain entity types. SKIPping has the same effect as an optional check that fails.

gothub commented 4 years ago

No further comments on this issue for quit awhile, so closing.