Include the last modified date in the bulk download files

mparsons-ce commented 1 year ago

@excelsior Would it be possible to include the last modified date from the envelope in the files included in the bulk download? re: GET {community}/envelopes/download.

An additional "feature" could be to append the last modified date (yyyy-mm-dd) to the file name (after the CTID).

excelsior commented 1 year ago

@mparsons-ce I've appended the last modified date to the filenames, but am not sure how to include it into the files themselves. Those files contain envelopes' raw JSON-LD payloads only, so adding extraneous properties to them doesn't feel right.

siuc-nate commented 1 year ago

Would it work (for both of you) to include an additional file that contains meta information about the other files in the download? A simple structure like

[
  { "ctid": "ce-abcdef", "updated_at": "2023-01-01-etc" },
  { "ctid": "ce-ghijklm", "updated_at": "2023-02-02-etc" },
  { "...": "..." }
]

Though I could see that eventually evolving into just including the entire envelope (minus resource/decoded_resource) in such an array, which might be simpler(?) from a development standpoint.

Alternatively, would there be any reason not to just have the download be a download of envelopes (with decoded_resource included) instead of just a download of payloads? Would that break any consuming systems? I would think the bulk downloads tend to get handled manually.

Maybe a separate endpoint to download all the envelopes (with decoded_resource included)?

Just some ideas. I'm not familiar enough with @mparsons-ce's use case to say for sure what might work best.

excelsior commented 1 year ago

All, I implemented the latter suggestion made by Nate. Now the bulk download files contain full envelope representations (including the last modified date), not just payloads.

excelsior commented 1 year ago

@mparsons-ce @siuc-nate The asynchronous envelope download feature is available on sandbox.

The API has two new endpoints:

POST /{community}/envelopes/downloads
GET /{community}/envelopes/downloads/{id}

To initiate a new download of all the CE Registry envelopes, execute the following request:

POST https://sandbox.credentialengineregistry.org/ce_registry/envelopes/downloads

The response will look like this:

{
    "id": "d5c3ed64-e4c6-4cc0-bbe2-75ff26ca0b47",
    "status": "pending",
    "url": null
}

Then use the id property above to check the download's status:

GET https://sandbox.credentialengineregistry.org/ce_registry/envelopes/downloads/d5c3ed64-e4c6-4cc0-bbe2-75ff26ca0b47

The possible statuses are pending, in progress, finished, and failed.

Once the download finishes successfully, the URL of the ZIP archive uploaded to the S3 bucket will be returned as well:

{
    "id": "d5c3ed64-e4c6-4cc0-bbe2-75ff26ca0b47",
    "status": "finished",
    "url": "https://cer-envelope-downloads.s3.us-east-2.amazonaws.com/ce_registry_1688525448_aff9fc282d6b9cca4b844d4f75a95b1a.zip"
}

Use the URL to access the data directly.

There's additional documentation in Swagger.

siuc-nate commented 1 year ago

I have updated our downloads page to make use of the above features. It appears to be working as intended, thanks.

siuc-nate commented 1 year ago

Closing this. I will reopen it if we have any issues on production. Thanks.

siuc-nate commented 1 year ago

I think something may have broken with this. I've been trying to download the data and the process has taken over an hour with no (apparent) result yet.

Do you have logging on your side that indicates an issue?

excelsior commented 1 year ago

@siuc-nate Yes, there was a problem. At the time you initiated those downloads, a bunch of envelopes have been published, so the queue was filled up with various indexing tasks. Also you attempted to download data ~20 times, so when those jobs finally kicked off, they started all at once and the server was overwhelmed by the load.

I re-ran the latest download, you can retrieve the data URL here:

GET https://credentialengineregistry.org/envelopes/downloads/ff747ad6-798a-4702-ae48-2216a6fa70e0

The plan for the future is the following:

Use dedicated queues with different priorities for various tasks, so more important ones don't get delayed by the others.
Throttle expensive tasks, so the servers don't get stuck.
Last but not least: Improve perfomance of indexing tasks.

siuc-nate commented 1 year ago

Thanks for the insight. I think I only initiated a few of those - it's likely that others were the result of testing by other team members before the question reached me and/or attempts by our partners to download the data. Those all should have been routed to the same handler on our end, but maybe they were timing out on our side or something.

siuc-nate commented 1 year ago

This still seems to be unresponsive - have you had any luck looking into the issue?

excelsior commented 1 year ago

@siuc-nate This functionality was disabled while I've been working on the fix. It's available now. The solution is to run one download at a time, so if multiple attempts are made it'll take some time to complete.

I have yet another improvement in mind, namely to reuse previous exports if there were no changes to the data. I'll let you know once it's implemented.

siuc-nate commented 1 year ago

Is this still down for maintenance by chance?

I was trying this again today, and it runs for a while, but eventually returns with a status of "failed":

{
    "id": "decb543d-c14e-4d7a-80cf-82e6bc3291c0",
    "status": "failed",
    "url": null
}

excelsior commented 1 year ago

@siuc-nate I finally found the bug. I used tempfiles to store the exported data and sometimes those files would be garbage-collected before they were uploaded to S3, hence those random failures. I switched to regular files, now it should function as expected. And still am working on a couple of other improvements.

CredentialEngine / CredentialRegistry

Include the last modified date in the bulk download files #638