NASA-PDS / data-upload-manager

Data Upload Manager (DUM) component for managing the interface for data uploads to the Planetary Data Cloud from Data Providers and PDS Nodes.
https://nasa-pds.github.io/data-upload-manager
Apache License 2.0
0 stars 0 forks source link

As a user, I want to skip upload of files that are already in the Registry #99

Open jordanpadams opened 5 months ago

jordanpadams commented 5 months ago

Checked for duplicates

Yes - I've already checked

🧑‍🔬 User Persona(s)

Node Operator

💪 Motivation

...so that I do not try to reload the data

📖 Additional Details

No response

Acceptance Criteria

Given When I perform Then I expect

⚙️ Engineering Details

The easiest way to do this would be search the registry either for the file path OR by checksum OR both? We could do this with the LID/LIDVID but I think that will add some significant overhead.

Do we want to figure out some sort of auto-generated UUID for every file we upload to the cloud and add this as metadata? Maybe this is something we could actually store then in the Nucleus database and eventually in the registry. It could link throughout the whole system, agnostic of the LIDVID for the products themselves.

tloubrieu-jpl commented 4 months ago

from the breakout meeting today: The API need to provide a simple end-point for DUM to retrieve files, critical information to provide is:

file name/path discipline node md5sum

We are not sure yet what the best key should provided by the API, either:

option1: node + file path → returns md5sum, lidvid option2: md5sum → returns node, file path and lidvid

With option 1, a new end-point for the API could be:

/files/{node}/{file_path} which would return the {md5sum} 

The issue is that the file_path is not always the same, on the staging bucket or where the file is archived eventually. 

@ramesh-maddegoda, @viviant100 could you investigate how the path on the archive bucket (ODR) is being created from the path in the staging bucket ?

tloubrieu-jpl commented 4 months ago

As discussed today, I will create a ticket to have an end-point in the api: /files/{md5sum} would return 200 or 404.

We'll make that part of the Registry API.

tloubrieu-jpl commented 4 months ago

@collinss-jpl can you validate that what is above works for you ?

collinss-jpl commented 4 months ago

@tloubrieu-jpl Yes I think that would work. Does the Registry API use API Gateway though? Will the DUM client need to provide an authentication token with the request to the new endpoint?

jordanpadams commented 1 month ago

@collinss-jpl @tloubrieu-jpl just want to check on a status for this? has this been implemented and can it be tested at least locally?

For DUM, we should just make sure we throw a warning when the API is down, but then keep going through the processing so we aren't blocking the workflow when we have system downtime.