PASTAplus / seo

Generate schema.org metadata from PASTA+ data package metadata
Apache License 2.0
3 stars 1 forks source link

Remove URI string literals from schema_org.py #32

Closed servilla closed 1 month ago

servilla commented 2 months ago

URI string literals in schema_org.py assume a particular environment. For example:

url = ("https://portal.edirepository.org/nis/mapbrowse?scope=" + scope +
           "&identifier=" + identifier + "&revision=" + revision)

or

                "contentUrl": (
                        "https://pasta.lternet.edu/package/metadata/eml/" +
                        file_name.split(".")[0] + "/" +
                        file_name.split(".")[1] + "/" +
                        file_name.split(".")[2]),

These URI string literals have to be corrected when SEO is deployed in any environment other than the production EDI repository. It would be better to auto-identify the correct environment or set the proper values in the configuration file, config.py.

clnsmth commented 2 months ago

Hi @servilla,

I've been looking into the issue of automatically identifying the correct environment and came up with a couple of ideas:

  1. Client-provided Environment Argument: We can check if the client is passing an env argument in the HTTP GET request. If present, we can access this information from the Flask request object in webapp/run.py at line 36: https://github.com/PASTAplus/seo/blob/a1a745cfeff2062588a409d9fdbf21e1faede501/webapp/run.py#L36 The argument value should match one of the recognized environment names in the if/else logic of webapp/schema_org.py starting at line 35: https://github.com/PASTAplus/seo/blob/a1a745cfeff2062588a409d9fdbf21e1faede501/webapp/schema_org.py#L35 Since the current logic isn't working, it's likely that the env argument is missing. This would cause the default value of None to be used, ultimately setting the environment to production.

  2. Server-side Environment Inference: We could infer the environment based on the request origin. This could involve:

    • Checking the referrer header in the request object.
    • Inspecting the IP address of the client and mapping it to the corresponding tier based on whitelisted IP addresses defined in the configuration file.

I'd appreciate your thoughts on these proposed solutions. Do you have any preferences or concerns?

clnsmth commented 2 months ago

After discussing with @servilla, we've clarified a misunderstanding about the scope of this issue.

The issue isn't related to the client missing an env argument in the GET request, which is being used correctly. Instead, it concerns the hardcoded portal and pasta environments within the convert_eml_to_schema_org function, specifically starting at lines #112 and #137, respectively.

To address this, we'll be making the following changes: