NASA-IMPACT / pyQuARC

The pyQuARC tool reads and evaluates metadata records with a focus on the consistency and robustness of the metadata. pyQuARC flags opportunities to improve or add to contextual metadata information in order to help the user connect to relevant data products. pyQuARC also ensures that information common to both the data product and the file-level metadata are consistent and compatible. pyQuARC frees up human evaluators to make more sophisticated assessments such as whether an abstract accurately describes the data and provides the correct contextual information. The base pyQuARC package assesses descriptive metadata used to catalog Earth observation data products and files. As open source software, pyQuARC can be adapted and customized by data providers to allow for quality checks that evolve with their needs, including checking metadata not covered in base package.
Apache License 2.0
19 stars 0 forks source link

QuARC does not pull in current master branch from pyQuARC repo #273

Closed jenny-m-wood closed 1 month ago

jenny-m-wood commented 5 months ago

Describe the bug QuARC does not pull in the current master branch of the pyQuARC repo. It is pulling in an outdated version, so the recommendations do not align.

To Reproduce Steps to reproduce the behavior:

  1. Go to QuARC and test using C1577484501-LARC_ASDC (umm-c)
  2. Run pyQuARC master branch on the record above
  3. See the discrepancies between the recommendations that pyQuARC's master branch provides and QuARC provides

Expected behavior Expect the output from pyQuARC's master branch and output from QuARC to be identical.

Additional context When briefly investigated, it seemed like it may be an issue with the Lambda function. An issue in QuARC was created for this in 2023 and may still exist.

jenny-m-wood commented 4 months ago

@xhagrg @slesaad ESDIS is promoting the use of QuARC for new missions such as PACE and NISAR, so please prioritize this ticket during pyQuARC development. Thanks!

jenny-m-wood commented 3 months ago

Alternative example record: C2068391958-LARC_ASDC (format: umm-c)

jenny-m-wood commented 3 months ago

I compared pyQuARC output and QuARC output, and it looks much better! Thank you so much for making those updates @rajeshpandey2053

I did notice that the URL recommendations from pyQuARC were missing from QuARC. When testing on C2103888967-LARC (dif10), this recommendation was provided by pyQuARC: Screenshot 2024-04-03 at 11 43 52 AM

QuARC however did not provide that recommendation, and this error message was shown: Screenshot 2024-04-03 at 11 44 45 AM

I noticed something similar when testing with G1001367981-LARC (echo-g). Any thoughts on why this may be happening or the next steps for resolving?

slesaad commented 3 months ago

@jenny-m-wood @rajeshpandey2053 is working on identifying what's causing it and then fixing it

rajeshpandey2053 commented 3 months ago

Issue

We are running pyQuARC in AWS Lambda functions to build the QuARC API. Lambda only supports a read-only file system. If someone attempts to write something to the Lambda, it throws an error. In our case, pyQuARC uses the urlextract package, which attempts to save some files in local storage for caching purposes, resulting in the error. Initialize function for URLExtract class. Tries to get cached TLDs, if cached file does not exist it will try to download new list from IANA and save it to cache file.

Solution

We need to find an alternative solution to the urlextract package that does not rely on writing to the file system.