Feature request: Automated management of file uploads

mjpost commented 2 years ago

When ingesting new volumes or making corrections to papers, our current process is to download the files to the local computer of whoever is doing the ingestion, and then manually upload them to the correct position via rsync or scp. There are a number of problems with this:

It is error-prone: we often upload files to the wrong place, forget to upload them entirely, or accidentally overwrite files with an incorrect version.
File uploads are out-of-sync with the website rebuild: for example, revisions get uploaded immediately, but are not reflected on the page until much later. The same occurs for ingestion, where this could be more problematic, since papers shouldn’t be made public until the official release date.

I’d like the replace this manual process with an automated one. I suggest a two-step process:

We move the upload functionality to a library function, which can be called from both ingestion scripts, revision scripts, and so on. However, instead of uploading to its proper place, the file is uploaded to a queue, where it sits in place until the proper time.
A process on the server runs periodically, moving files from the queue into place.

Library function

We write a library function, say upload_to_queue, which takes a file containing its proper Anthology ID, and uploads it to the queue. It can perform a number of checks:

Ensuring that the file upload is “licensed” by the code in the current branch. For example, a file revision should only be uploaded if the current XML contains an entry for it.
Ensuring that the checksum on the file matches that in the XML.

Queue management

On the server, files should not be put into place until the respective code is live. A cron script should regularly update the local code’s master branch. It can then check the queue, looking for files that can be moved into place.

xinru1414 commented 2 years ago

We already have a ticket for this #1740 ?

mjpost commented 2 years ago

I guess we should really implement this feature!

mjpost commented 2 years ago

Also, it's worth pointing out that this elaborates a bit, since the idea for the queue feature is new.

davidstap commented 2 years ago

I can take a look after I get the IWSLT proceedings out of the way :) working on that now.

xinru1414 commented 2 years ago

@mjpost @davidstap since #1713 is wrapped up i can move on to this now.

xinru1414 commented 2 years ago

@mjpost I have several questions:

Regarding the queue feature and queue management
- Should the queue be on the server? Or person_in_charge's local machine?
A process on the server runs periodically, moving files from the queue into place.

From this sentence I infer that the queue should be on the server. However it is possible that this periodically run process on the server is going to transfer files from the queue on person_in_charge's local to the server.

A cron script should regularly update the local code’s master branch. It can then check the queue, looking for files that can be moved into place.

I'm a bit confused here. Does this cron script refer to the same thing as the periodically run process on the server? Or are you suggesting two processes/scripts i.e. one on the server moving files from the queue into place and one on local machine that update's local's master branch?

My suggestion is one script (should be ran from local) for the queue management. This script should 1) updates local's master branch and 2) transfers files in queue (on server) into place (on server).
- By queue, do you mean a directory in a structured way? For example, our current way of transferring from local to the server: person_in_charge_local_root/anthology_files/pdf/VENUE. If that's the case, I imagine this queue dir can be: anthology_files/queue/pdf/VENUE, anthology_files/queue/attachments/VENUE
Regarding the library upload function
- Considering what you put down herer and for the upload.py in #1740, the script should 1) ensure file is licensed 2) ensure checksum matches and 3) infer file location in queue (assuming we use a structured dir).

mjpost commented 2 years ago

Let’s do this in two pieces: upload first.

The library function should upload to anthology:anthology-files/queue/{type}/{filename}.{checksum, where {checksum} is the file checksum (CRC32),{filename} is the normal filename, and {type} is the top-level type (currently one of pdf, attachment, or video. So the queue is parallel to the actual directory structure. I think the function signature should be like this:

def upload_file(local_path, resource_type, dest_file_name, checksum)

e.g.,

upload_file(“/Users/post/1.pdf”, “pdf”, “W19-6319v2.pdf”, “XXX”)

where XXX is some CRC32 checksum. The function will upload the file to anthology:anthology-files/queue/pdf/W19-6319v2.pdf.XXX, after confirming that the checksum is correct.

The CLI for the script will take a single file or a directory of files, figure out the correct name, and use this function to upload each of them. Since the destination file name is a function of the context, it is up to the caller to figure out the name. For example, when a file is ingested, its target name is something like 2022.acl-main.9382.pdf; when a revision occurs, its name is 2022.acl-main.9382v2.pdf, and so on.

Is that clear enough? Have I covered everything? Once that’s in place, we can talk about the server part.

xinru1414 commented 2 years ago

@mjpost Thanks for the clarification. The plan sounds good. The upload piece is clear. I'll reach out if I have more Qs.

acl-org / acl-anthology

Feature request: Automated management of file uploads #1818

Library function

Queue management