COKI - create script to fetch all outputs found so far

briri commented 2 months ago

Jamie has requested that I pull the list of all research outputs we've found so far for from the DataCite harvester.

I am going to create a script to do this as it seems like it will be useful long term.

briri commented 2 months ago

Need to setup a lambda to generate this on a weekly basis.

Will need to place the resulting dataset into an S3 bucket so that it can be transferred to Google Cloud. Use the JSON lines format.

Need to investigate which is more cost effective:

briri commented 2 months ago

We will are going to explore using AWS pre-signed URLs for S3. We have other teams here that use this technology to allow external users/systems to fetch objects from (and push objects to) our S3 buckets.

@jdddog you mentioned that you are working within a python environment. Would you be able to call an API endpoint from within your code to retrieve these URLs?

I have an API that I can add an endpoint to that we could use to allow you to fetch a predesigned URL on demand. The resigned URL could then be used to write the datacite, crossref and openalex files to.

I could then also add an endpoint that would allow you to fetch the latest DMP metadata from us as well.

briri commented 2 months ago

Had some conversations here and it sounds like the presigned URL route will work well.

I will need to:

General infrastructure
- Setup an S3 bucket for our data exchange AWS repo issue #175
- Setup a Cognito client for COKI to use. Create a new scope for transferring data AWS repo issue #176
- Verify that the new Cognito client is able to read/write to the S3 bucket via presigned URLs
Generating DMP metadata for COKI
- Create a Lambda that builds the DMP metadata file (scheduled to run each week) and writes it to the S3 bucket. I already have a script that does this, so should not be much work #18
- Create a Lambda function behind our API Gateway that can be used to generate a presigned URL that will allow COKI to fetch the latest DMP file #19
Submitting match files (e.g. DataCite, Crossref, OpenAlex, etc.) to CDL
- Create a Lambda function behind our API Gateway that generates a presigned URL that can then be used to push files to the CDL S3 bucket #20
- Create a Lambda that is triggered by the arrival of the new match files and updates the DMPTool. I can modify the existing DataCite harvester code to work with these new files. #21

Example Lambda to generate a predesigned URL on demand:

def get_file(key)
    @s3_client = Aws::S3::Client.new(region: ENV.fetch('AWS_REGION', nil))
    @presigner = Aws::S3::Presigner.new(client: @s3_client)
    bucket_name = env.fetch('BUCKET_NAME', nil)
    begin
      @s3_client.head_object({bucket: bucket_name, key: key})
    rescue Aws::S3::Errors::NotFound
      halt 404, "Object \"#{key}\" not found in S3 bucket \"#{bucket_name}\"\n"
    end
    url, headers = @presigner.presigned_request(:get_object, bucket: bucket_name, key: key)
    if url
      response.headers['Location'] = url
      status 303
      "success: redirecting"
    end
  end

Python example of

jdddog commented 1 month ago

Hey @briri, we can use pre-signed URLs if you would prefer, its no problem to call an API from our Apache Airflow environment.

Some thoughts:

The maximum file size for each file exported from BigQuery is 1GB (before compression), after that, files are sharded into parts, e.g., crossref_000000000000.jsonl.gz, crossref_000000000001.jsonl.gz, etc. So as we get more DMPs, we will get multiple files per dataset.
- Can we upload as many files as needed for each dataset using the pre-signed URLs? It seems a URL must be generated for each file.
In BigQuery, we shard the tables with date (year, month and day) to avoid conflicts between weekly runs. The date could be the date that the new DMP file is generated.
- Could this date be included in the DMP export? E.g. in the file name or data.

I wonder if it would simpler to directly use the boto3 S3 Python package to read and write files from the bucket, it would require less setup and would be more flexible when uploading an unknown number of files for each dataset. It is also probably easier and more robust to use the s3 Python package to upload the files. The DMP export files could be stored on a read only path, configured via an IAM role. The dataset match files could be stored on a write-only path, also configured via an IAM role. It wouldn't offer as much fine grained access control.

Let me know what you think.

briri commented 1 month ago

Hi @jdddog. From my conversations with my colleagues here, the presigned URL option can handle a file up to a 5GB in size before you need to break it apart. So we shouldn't have an issue there. I can add logic to process/combine them back together on my end.

Direct read/write access to the S3 bucket is not a preferred method for us here, so we will need to stick with the presigned url route.

CDLUC3 / dmsp_api_prototype

COKI - create script to fetch all outputs found so far #15