Unpredictable problem w/ ALDocument.as_pdf() method - displays wrong file from server

SuffolkLITLab / docassemble-AssemblyLine

Quickly go from a paper court form to a runnable, guided, step-by-step web application powered by Docassemble. Swap out branding and pre-built questions to meet your needs.

https://suffolklitlab.org/docassemble-AssemblyLine-documentation/

MIT License

42 stars 5 forks source link

Unpredictable problem w/ ALDocument.as_pdf() method - displays wrong file from server #263

Closed nonprofittechy closed 3 years ago

nonprofittechy commented 3 years ago

I have had a few reports and now witnessed myself a situation where the as_pdf() method displays a wrong, cached file from the server. Most recently, I ran the CDC Moratorium interview and on the preview screen it linked the PDF for the civil action cover sheet. On the final download screen, it linked the PDF for the Civil Docketing Statement. Both are interviews I ran yesterday evening.

I assume our aggressive caching of the ALDocument is to blame, although it could also be a Docassemble bug. We should try to reproduce and track this problem down. Also need to figure out if this could possibly leak private data.

For now, restarting nginx seems to fix it (by saving the config file, e.g.).

Edit: note that the download_list_html() results appeared to be correct--only the file displayed with as_pdf() was wrong.

nonprofittechy commented 3 years ago

Plocket tried a test reproduction, using the test_alexhibit.yml interview. She uploaded this file: file and I confirmed that the contents match inside the docker container: it was /tmp/files/101283/file.jpg and the same path on S3 also matched.

reproduction.json.txt

However, when the file was added to an ALDocumentBundle, this uploaded file doesn't appear in any of the bundles.

Restarting the server by editing and saving the config does not change the output at all.

exhibits_doc_custom.pdf exhibits_doc_defaults.pdf

nonprofittechy commented 3 years ago

Potentially relevant: https://blog.miguelgrinberg.com/post/how-secure-is-the-flask-user-session

nonprofittechy commented 3 years ago

Notes from Coding meeting:

Potential causes?

Class-scoped attributes?
Race condition w/ S3?
Some other problem with the DALazyAttribute class?
Shared memory issue?
- This would be in the flask session, we presume.
Should we investigate file “number” collisions? We could log filename/number each time

nonprofittechy commented 3 years ago

Looked at a couple places in the DA source code.

File number is created in a postgres sequence that is supposed to guarantee uniqueness.

There are no duplicate file numbers in the database:


docassemble=# select * from uploads ou
docassemble-# where (select count(*) from uploads inr
docassemble(# where inr.indexno = ou.indexno) > 1
docassemble-# ;
indexno | key | filename | yamlfile | private | persistent
---------+-----+----------+----------+---------+------------
(0 rows)

nonprofittechy commented 3 years ago

We tried to force this to recur by running 400 interviews in a row and did not get any form leakage. Also investigated the server uptime on affected servers and all the servers restarted about 4 days ago. It's possible it was due to an unsafe shutdown caused by an automated server security patch. Going to keep this open a couple more days and then close. Hopefully we can get some fixes upstream, if it really was a database issue, to ensure that unsafe shutdowns don't leak data.

BryceStevenWilley commented 3 years ago

So to check, the leak would be because we've created files up to some number, say 1000. Then we save, and create some more, up to 1500. Then the server shutsdown unexpectedly, and the postgres db isn't full saved. So, S3 has existing files up to 1500, but postgres only thinks that there's 1000, so the next 500 files will conflict with existing files.

Is there any process we (or DA core) can run after a potential shutdown / upgrade to catch that discrepancy? Should be able to get the current val of the index with currval

nonprofittechy commented 3 years ago

That is my guess about how this explanation would work, yes. I think there are a number of mitigations that DA could take to avoid it presenting exactly this way--but to be fair, a database rollback is always going to have negative consequences.

Truthfully Postgres can handle an unsafe shutdown 99% of the time with no issue. So I wonder if it's really necessary to load the backup from 6 AM every time you restart the server. Maybe it should try just starting up and checking for corruption, at least optionally.

nonprofittechy commented 3 years ago

We haven't seen this problem come back--I think we can safely assume that the database changes that Jonathan made will resolve this in the future. We still need to schedule a new docker run, and should do our best to set up an external database for our production server when it's possible.