edgi-govdata-archiving / archivers.space

🗄 Event data management app used at DataRescues
https://www.archivers.space/
GNU Affero General Public License v3.0
6 stars 3 forks source link

Compare UUIDs in Harvest phase in app w/ UUIDs of datasets already uploaded #36

Open suchthis opened 7 years ago

suchthis commented 7 years ago

It looks like at least some URLs that may have already been harvested are still in Harvest phase in the app, since the user did not click the checkbox next to Harvest. Since this step was not included in previous event workflow documentation, it may be a widespread issue, so it may make sense to programmatically compare UUIDs of datasets already uploaded with UUIDs still in Harvest phase in app (and change status in app for any UUIDs with uploaded datasets).

Example: http://www.archivers.space/urls/F68DCA69-4377-40DA-B576-7D3C88CC6C2A

Harvest notes: "Over 6,000 files totaling 82 GB. Largest file is 12 GB, which is a massive orthographic mosaic tif. Zip file of 62 GB was uploaded via AWS token, appears to have completed successfully at 5:26 PM, though this site does not seem to acknowledge it."

This may explain why there are relatively few URLs in post-harvest phases in the app, despite the many recent events.

kmcculloch commented 7 years ago

See https://github.com/edgi-govdata-archiving/archivers.space/issues/38 for UI recommendation on how to keep this from happening in the future.

I'm using the "data integrity" label for cleanup tasks like this. I'd love to hear from devs with more MongoDB experience how they would handle a task like this on the db side.

Other options:

  1. Rather than writing an update script, we could write a query that finds URLs that might be in this state and displays them in the app for an admin to review
  2. Or we could skip the coding altogether and ask some volunteers to review the harvest list by hand