Refactor URL extraction logic in url_validator.py

rajeshpandey2053 commented 6 months ago

Description:

This pull request addresses the discrepancy in URL recommendations between pyQuARC and QuARC.

Issue:

We are running pyQuARC in AWS Lambda functions to build the QuARC API. Lambda only supports a read-only file system. If someone attempts to write something to the Lambda, it throws an error. In our case, pyQuARC uses the urlextract package, which attempts to save some files in local storage for caching purposes, resulting in the error. The following is an extract from the class implemented in the pyQuARC. Initialize function for URLExtract class. Tries to get cached TLDs, if the cached file does not exist it will try to download the new list from IANA and save it to cache file.

Process:

To resolve this, the dependency on the urlextract package has been replaced with regex expressions for URL extraction, eliminating the file system dependency.

Testing :

Testing was done with the following list of concept ids with their respective formats ensuring we extract same list of URLs from a text using the regex expressions and urlextract package.

For further details : https://github.com/NASA-IMPACT/pyQuARC/issues/273

jenny-m-wood commented 6 months ago

Thanks for your changes. I noticed the following during testing:

When testing C2433571719-CDDIS (umm-c), there is a data format error present in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. "SINEX" is a valid GCMD keyword, so no error should be present. Perhaps the GCMD keywords for QuARC are out of date? See screenshot of QuARC dev output:
When testing C2103888967-LARC (dif10), there is an extra broken URL specified in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. It is https://www.atmosp.physics.utoronto.ca/MOPITT/home.html. See screenshot of QuARC dev output: See screenshot of pyQuARC fix_check_url output:

xhagrg commented 5 months ago

Did you check https://github.com/lipoja/URLExtract/issues/61 @rajeshpandey2053? We should be using packages when possible.

rajeshpandey2053 commented 5 months ago

Thanks for your changes. I noticed the following during testing:

When testing C2433571719-CDDIS (umm-c), there is a data format error present in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. "SINEX" is a valid GCMD keyword, so no error should be present. Perhaps the GCMD keywords for QuARC are out of date? See screenshot of QuARC dev output:

When testing C2103888967-LARC (dif10), there is an extra broken URL specified in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. It is https://www.atmosp.physics.utoronto.ca/MOPITT/home.html. See screenshot of QuARC dev output: See screenshot of pyQuARC fix_check_url output:

The first one is due to the dev environment of quARC having a master branch of pyQuARC. I have updated it to the dev branch.
Second discrepancy is due to we have used python 3.8 as a lambda runtime version in quARC. And it uses the old version of OpenSSL. Created new ticket to resolve this issue

xhagrg commented 5 months ago

@rajeshpandey2053 has this been tested? If yes, LGTM.

rajeshpandey2053 commented 5 months ago

@rajeshpandey2053 has this been tested? If yes, LGTM.

Yes, it has been tested as well. Thank you will merge it then

NASA-IMPACT / pyQuARC