Closed rajeshpandey2053 closed 5 months ago
Thanks for your changes. I noticed the following during testing:
When testing C2433571719-CDDIS (umm-c), there is a data format error present in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. "SINEX" is a valid GCMD keyword, so no error should be present. Perhaps the GCMD keywords for QuARC are out of date? See screenshot of QuARC dev output:
When testing C2103888967-LARC (dif10), there is an extra broken URL specified in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. It is https://www.atmosp.physics.utoronto.ca/MOPITT/home.html. See screenshot of QuARC dev output: See screenshot of pyQuARC fix_check_url output:
Did you check https://github.com/lipoja/URLExtract/issues/61 @rajeshpandey2053? We should be using packages when possible.
Thanks for your changes. I noticed the following during testing:
- When testing C2433571719-CDDIS (umm-c), there is a data format error present in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. "SINEX" is a valid GCMD keyword, so no error should be present. Perhaps the GCMD keywords for QuARC are out of date? See screenshot of QuARC dev output:
- When testing C2103888967-LARC (dif10), there is an extra broken URL specified in the QuARC dev environment output that is not present in pyQuARC's fix_check_url branch output. It is https://www.atmosp.physics.utoronto.ca/MOPITT/home.html. See screenshot of QuARC dev output: See screenshot of pyQuARC fix_check_url output:
@rajeshpandey2053 has this been tested? If yes, LGTM.
@rajeshpandey2053 has this been tested? If yes, LGTM.
Yes, it has been tested as well. Thank you will merge it then
Description:
This pull request addresses the discrepancy in URL recommendations between pyQuARC and QuARC.
Issue:
We are running pyQuARC in AWS Lambda functions to build the QuARC API. Lambda only supports a read-only file system. If someone attempts to write something to the Lambda, it throws an error. In our case, pyQuARC uses the
urlextract
package, which attempts to save some files in local storage for caching purposes, resulting in the error. The following is an extract from the class implemented in the pyQuARC.Initialize function for URLExtract class. Tries to get cached TLDs, if the cached file does not exist it will try to download the new list from IANA and save it to cache file.
Process:
To resolve this, the dependency on the
urlextract
package has been replaced with regex expressions for URL extraction, eliminating the file system dependency.Testing :
Testing was done with the following list of concept ids with their respective formats ensuring we extract same list of URLs from a text using the regex expressions and
urlextract
package.For further details : https://github.com/NASA-IMPACT/pyQuARC/issues/273