NASA-IMPACT / pyQuARC

The pyQuARC tool reads and evaluates metadata records with a focus on the consistency and robustness of the metadata. pyQuARC flags opportunities to improve or add to contextual metadata information in order to help the user connect to relevant data products. pyQuARC also ensures that information common to both the data product and the file-level metadata are consistent and compatible. pyQuARC frees up human evaluators to make more sophisticated assessments such as whether an abstract accurately describes the data and provides the correct contextual information. The base pyQuARC package assesses descriptive metadata used to catalog Earth observation data products and files. As open source software, pyQuARC can be adapted and customized by data providers to allow for quality checks that evolve with their needs, including checking metadata not covered in base package.
Apache License 2.0
19 stars 0 forks source link

possible bugfix #170

Closed CarsonDavis closed 2 years ago

CarsonDavis commented 2 years ago

Original Bug

So, the master branch was exhibiting an error when validating that the gcmd short and long pair given for an item matched a valid gcmd short and long pair.

You can see an error thrown when running python main.py --format dif10 --fake FAKE

>> DIF/Platform/Instrument/Short_Name: 
        Error: The provided instrument short name `MODIS` and long name `Moderate-Resolution Imaging Spectroradiometer` aren't consistent.
        Please supply the corresponding long name for the short name.

This appears to be a non-existent error, as MODIS is in fact the Moderate-Resolution Imaging Spectroradiometer..

Bugfix

Although this bug appears on master, it does not appear an a recent feature branch subbranch_of_feature/ummc_support. According to Jenny and Shelby, this branch has some problems and can't be used in it's entirety. I don't know anything about pyQuARC, so I tried to trace back all the relevant code from the working branch and port it over.

However, after moving over all the code that was relevant, a new error appeared.

  File "/home/carson/github/pyQuARC/pyQuARC/code/gcmd_validator.py", line 125, in _create_hierarchy_dict
    GcmdValidator.merge_dicts(hierarchy_dict, row_dict)
  File "/home/carson/github/pyQuARC/pyQuARC/code/gcmd_validator.py", line 200, in merge_dicts
    parent[key], _ = GcmdValidator.merge_dicts(parent[key], child[key])
  File "/home/carson/github/pyQuARC/pyQuARC/code/gcmd_validator.py", line 199, in merge_dicts
    if parent.get(key):
AttributeError: 'str' object has no attribute 'get'

We can trace back this error to the following bit of code: https://github.com/NASA-IMPACT/pyQuARC/blob/d3995b025ffe106169af2e91eb3de7ba8e3e0fda/pyQuARC/code/gcmd_validator.py#L178-L193

What's happening is that some of the parent values are equal to this_is_the_leaf_node. Here is an output from when you print there values before executing parent.get(key_.

parent= 'this_is_the_leaf_node'
child= {'SOUNDER DETECTOR 2': 'this_is_the_leaf_node'}

In this pull request, I have circumvented this error with some questionable code that might not be good. https://github.com/NASA-IMPACT/pyQuARC/blob/d3995b025ffe106169af2e91eb3de7ba8e3e0fda/pyQuARC/code/gcmd_validator.py#L185-L186 Basically, I'm just running a check to see if the parent is a leaf and returning the values directly.

Concern

The last thing I did feels hacky, because I can't understand why a parent would be a leaf to begin with. Surely something is wrong somewhere, with some logic or maybe the input csv.

GNSS RECEIVER and SOUNDER DETECTOR both throw this error. I went into the csv file and discovered that I could make them disappear by updating the csv in certain places. https://github.com/NASA-IMPACT/pyQuARC/blob/d3995b025ffe106169af2e91eb3de7ba8e3e0fda/pyQuARC/schemas/instruments.csv#L508-L515 If you replace the empty quotes in line 507 with the long name Sounder Detector 1, this value is no longer flagged as a faulty leaf.