Closed ntai-arxiv closed 2 months ago
The 3 of scripts under the sync_prod_to_gcp are a little confusing. There is a systemd config just for submissions_to_gcp so it seems that this is the only one in use now. Could the sync_published ones be moved to their own dir? Could all the other scripts be moved to their own dirs?
submissions_to_gcp.py uses sync_published_to_gcp.py's ensure_FOO. And, for how, the cron job is still using sync_published_to_gcp.py/sync_published.sh. If any, webnode_pdf_request.py is def. obsolete.
I will think about how to consolidate the 3 services. However, I'd like to keep the files as is since the deployment relies on these python and shell scripts. (for now)
Once the submissons_to_gcp service works without the hitch, we can start the consolidation, starting from stopping the 6 cronjobs. Until then, I would not like to stir the pot too much.
This is intended to replace 2 services - one is syncing the files, the other is to ask webnode the PDF.
Reads the paper IDs, etc. from pubsub (let's call each queue element / paper ID "job")
Figures out the files that should be on the GCP
Copies the files that's not there
For /ftp directory, the extra files are moved to "/trash"
Retries based on the accept/reject of the job completion.
Relies on the queue back off for retries
When the job is stuck for long time (30 minutes), send a email to admin team and slack for us. This means that - rather than the queue element is stuck in the queue, the email / admin ticket is the queue.
It starts the local http server to mimic the web node and it copies the pdf to the correct location. It can also 404 or timeout
Unit-testing makes sure the correct files are listed from the job request.
It can trash the obsolete (GCP) files - which is the point of this ticket.
I need to deploy this and then monitor the logs to see the existing cron jobs of syncing does nothing.