CDLUC3 / dmsp_api_prototype

The new DMPTool API (formerly the DMPHub)
MIT License
0 stars 0 forks source link

COKI - create script to fetch all outputs found so far #15

Closed briri closed 1 month ago

briri commented 2 months ago

Jamie has requested that I pull the list of all research outputs we've found so far for from the DataCite harvester.

I am going to create a script to do this as it seems like it will be useful long term.

briri commented 2 months ago

Need to setup a lambda to generate this on a weekly basis.

Will need to place the resulting dataset into an S3 bucket so that it can be transferred to Google Cloud. Use the JSON lines format.

Need to investigate which is more cost effective:

briri commented 2 months ago

We will are going to explore using AWS pre-signed URLs for S3. We have other teams here that use this technology to allow external users/systems to fetch objects from (and push objects to) our S3 buckets.

@jdddog you mentioned that you are working within a python environment. Would you be able to call an API endpoint from within your code to retrieve these URLs?

I have an API that I can add an endpoint to that we could use to allow you to fetch a predesigned URL on demand. The resigned URL could then be used to write the datacite, crossref and openalex files to.

I could then also add an endpoint that would allow you to fetch the latest DMP metadata from us as well.

briri commented 2 months ago

Had some conversations here and it sounds like the presigned URL route will work well.

I will need to:

Example Lambda to generate a predesigned URL on demand:

def get_file(key)
    @s3_client = Aws::S3::Client.new(region: ENV.fetch('AWS_REGION', nil))
    @presigner = Aws::S3::Presigner.new(client: @s3_client)
    bucket_name = env.fetch('BUCKET_NAME', nil)
    begin
      @s3_client.head_object({bucket: bucket_name, key: key})
    rescue Aws::S3::Errors::NotFound
      halt 404, "Object \"#{key}\" not found in S3 bucket \"#{bucket_name}\"\n"
    end
    url, headers = @presigner.presigned_request(:get_object, bucket: bucket_name, key: key)
    if url
      response.headers['Location'] = url
      status 303
      "success: redirecting"
    end
  end

Python example of

jdddog commented 1 month ago

Hey @briri, we can use pre-signed URLs if you would prefer, its no problem to call an API from our Apache Airflow environment.

Some thoughts:

I wonder if it would simpler to directly use the boto3 S3 Python package to read and write files from the bucket, it would require less setup and would be more flexible when uploading an unknown number of files for each dataset. It is also probably easier and more robust to use the s3 Python package to upload the files. The DMP export files could be stored on a read only path, configured via an IAM role. The dataset match files could be stored on a write-only path, also configured via an IAM role. It wouldn't offer as much fine grained access control.

Let me know what you think.

briri commented 1 month ago

Hi @jdddog. From my conversations with my colleagues here, the presigned URL option can handle a file up to a 5GB in size before you need to break it apart. So we shouldn't have an issue there. I can add logic to process/combine them back together on my end.

Direct read/write access to the S3 bucket is not a preferred method for us here, so we will need to stick with the presigned url route.