acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
384 stars 256 forks source link

Feature request: Automated management of file uploads #1818

Open mjpost opened 2 years ago

mjpost commented 2 years ago

When ingesting new volumes or making corrections to papers, our current process is to download the files to the local computer of whoever is doing the ingestion, and then manually upload them to the correct position via rsync or scp. There are a number of problems with this:

  1. It is error-prone: we often upload files to the wrong place, forget to upload them entirely, or accidentally overwrite files with an incorrect version.
  2. File uploads are out-of-sync with the website rebuild: for example, revisions get uploaded immediately, but are not reflected on the page until much later. The same occurs for ingestion, where this could be more problematic, since papers shouldn’t be made public until the official release date.

I’d like the replace this manual process with an automated one. I suggest a two-step process:

  1. We move the upload functionality to a library function, which can be called from both ingestion scripts, revision scripts, and so on. However, instead of uploading to its proper place, the file is uploaded to a queue, where it sits in place until the proper time.
  2. A process on the server runs periodically, moving files from the queue into place.

Library function

We write a library function, say upload_to_queue, which takes a file containing its proper Anthology ID, and uploads it to the queue. It can perform a number of checks:

Queue management

On the server, files should not be put into place until the respective code is live. A cron script should regularly update the local code’s master branch. It can then check the queue, looking for files that can be moved into place.

xinru1414 commented 2 years ago

We already have a ticket for this #1740 ?

mjpost commented 2 years ago

I guess we should really implement this feature!

mjpost commented 2 years ago

Also, it's worth pointing out that this elaborates a bit, since the idea for the queue feature is new.

davidstap commented 2 years ago

I can take a look after I get the IWSLT proceedings out of the way :) working on that now.

xinru1414 commented 2 years ago

@mjpost @davidstap since #1713 is wrapped up i can move on to this now.

xinru1414 commented 2 years ago

@mjpost I have several questions:

mjpost commented 2 years ago

Let’s do this in two pieces: upload first.

The library function should upload to anthology:anthology-files/queue/{type}/{filename}.{checksum, where {checksum} is the file checksum (CRC32),{filename} is the normal filename, and {type} is the top-level type (currently one of pdf, attachment, or video. So the queue is parallel to the actual directory structure. I think the function signature should be like this:

def upload_file(local_path, resource_type, dest_file_name, checksum)

e.g.,

upload_file(“/Users/post/1.pdf”, “pdf”, “W19-6319v2.pdf”, “XXX”)

where XXX is some CRC32 checksum. The function will upload the file to anthology:anthology-files/queue/pdf/W19-6319v2.pdf.XXX, after confirming that the checksum is correct.

The CLI for the script will take a single file or a directory of files, figure out the correct name, and use this function to upload each of them. Since the destination file name is a function of the context, it is up to the caller to figure out the name. For example, when a file is ingested, its target name is something like 2022.acl-main.9382.pdf; when a revision occurs, its name is 2022.acl-main.9382v2.pdf, and so on.

Is that clear enough? Have I covered everything? Once that’s in place, we can talk about the server part.

xinru1414 commented 2 years ago

@mjpost Thanks for the clarification. The plan sounds good. The upload piece is clear. I'll reach out if I have more Qs.