NASA-IMPACT / pyQuARC

The pyQuARC tool reads and evaluates metadata records with a focus on the consistency and robustness of the metadata. pyQuARC flags opportunities to improve or add to contextual metadata information in order to help the user connect to relevant data products. pyQuARC also ensures that information common to both the data product and the file-level metadata are consistent and compatible. pyQuARC frees up human evaluators to make more sophisticated assessments such as whether an abstract accurately describes the data and provides the correct contextual information. The base pyQuARC package assesses descriptive metadata used to catalog Earth observation data products and files. As open source software, pyQuARC can be adapted and customized by data providers to allow for quality checks that evolve with their needs, including checking metadata not covered in base package.
Apache License 2.0
19 stars 0 forks source link

Refactor URL extraction logic in url_validator.py #278

Closed rajeshpandey2053 closed 5 months ago

rajeshpandey2053 commented 6 months ago

Description:

This pull request addresses the discrepancy in URL recommendations between pyQuARC and QuARC.

Issue:

We are running pyQuARC in AWS Lambda functions to build the QuARC API. Lambda only supports a read-only file system. If someone attempts to write something to the Lambda, it throws an error. In our case, pyQuARC uses the urlextract package, which attempts to save some files in local storage for caching purposes, resulting in the error. The following is an extract from the class implemented in the pyQuARC. Initialize function for URLExtract class. Tries to get cached TLDs, if the cached file does not exist it will try to download the new list from IANA and save it to cache file.

Process:

To resolve this, the dependency on the urlextract package has been replaced with regex expressions for URL extraction, eliminating the file system dependency.

Testing :

Testing was done with the following list of concept ids with their respective formats ensuring we extract same list of URLs from a text using the regex expressions and urlextract package.

For further details : https://github.com/NASA-IMPACT/pyQuARC/issues/273

jenny-m-wood commented 6 months ago

Thanks for your changes. I noticed the following during testing:

xhagrg commented 5 months ago

Did you check https://github.com/lipoja/URLExtract/issues/61 @rajeshpandey2053? We should be using packages when possible.

rajeshpandey2053 commented 5 months ago

Thanks for your changes. I noticed the following during testing:

  • When testing C2433571719-CDDIS (umm-c), there is a data format error present in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. "SINEX" is a valid GCMD keyword, so no error should be present. Perhaps the GCMD keywords for QuARC are out of date? See screenshot of QuARC dev output: Screenshot 2024-04-08 at 1 37 24 PM
  • When testing C2103888967-LARC (dif10), there is an extra broken URL specified in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. It is https://www.atmosp.physics.utoronto.ca/MOPITT/home.html. See screenshot of QuARC dev output: Screenshot 2024-04-08 at 2 22 26 PM See screenshot of pyQuARC fix_check_url output: Screenshot 2024-04-08 at 2 23 17 PM
xhagrg commented 5 months ago

@rajeshpandey2053 has this been tested? If yes, LGTM.

rajeshpandey2053 commented 5 months ago

@rajeshpandey2053 has this been tested? If yes, LGTM.

Yes, it has been tested as well. Thank you will merge it then