Open mjpost opened 2 years ago
We already have a ticket for this #1740 ?
I guess we should really implement this feature!
Also, it's worth pointing out that this elaborates a bit, since the idea for the queue feature is new.
I can take a look after I get the IWSLT proceedings out of the way :) working on that now.
@mjpost @davidstap since #1713 is wrapped up i can move on to this now.
@mjpost I have several questions:
Regarding the queue feature and queue management
A process on the server runs periodically, moving files from the queue into place.
From this sentence I infer that the queue should be on the server. However it is possible that this periodically run process on the server is going to transfer files from the queue on person_in_charge's local to the server.
A cron script should regularly update the local code’s master branch. It can then check the queue, looking for files that can be moved into place.
I'm a bit confused here. Does this cron script refer to the same thing as the periodically run process on the server? Or are you suggesting two processes/scripts i.e. one on the server moving files from the queue into place and one on local machine that update's local's master branch?
My suggestion is one script (should be ran from local) for the queue management. This script should 1) updates local's master branch and 2) transfers files in queue (on server) into place (on server).
person_in_charge_local_root/anthology_files/pdf/VENUE
. If that's the case, I imagine this queue dir can be: anthology_files/queue/pdf/VENUE
, anthology_files/queue/attachments/VENUE
Regarding the library upload function
upload.py
in #1740, the script should 1) ensure file is licensed 2) ensure checksum matches and 3) infer file location in queue (assuming we use a structured dir).Let’s do this in two pieces: upload first.
The library function should upload to anthology:anthology-files/queue/{type}/{filename}.{checksum
, where {checksum}
is the file checksum (CRC32),{filename}
is the normal filename, and {type}
is the top-level type (currently one of pdf
, attachment
, or video
. So the queue is parallel to the actual directory structure. I think the function signature should be like this:
def upload_file(local_path, resource_type, dest_file_name, checksum)
e.g.,
upload_file(“/Users/post/1.pdf”, “pdf”, “W19-6319v2.pdf”, “XXX”)
where XXX
is some CRC32 checksum. The function will upload the file to anthology:anthology-files/queue/pdf/W19-6319v2.pdf.XXX
, after confirming that the checksum is correct.
The CLI for the script will take a single file or a directory of files, figure out the correct name, and use this function to upload each of them. Since the destination file name is a function of the context, it is up to the caller to figure out the name. For example, when a file is ingested, its target name is something like 2022.acl-main.9382.pdf
; when a revision occurs, its name is 2022.acl-main.9382v2.pdf
, and so on.
Is that clear enough? Have I covered everything? Once that’s in place, we can talk about the server part.
@mjpost Thanks for the clarification. The plan sounds good. The upload piece is clear. I'll reach out if I have more Qs.
When ingesting new volumes or making corrections to papers, our current process is to download the files to the local computer of whoever is doing the ingestion, and then manually upload them to the correct position via
rsync
orscp
. There are a number of problems with this:I’d like the replace this manual process with an automated one. I suggest a two-step process:
Library function
We write a library function, say
upload_to_queue
, which takes a file containing its proper Anthology ID, and uploads it to the queue. It can perform a number of checks:Queue management
On the server, files should not be put into place until the respective code is live. A cron script should regularly update the local code’s
master
branch. It can then check the queue, looking for files that can be moved into place.