Dynamic service (jupyter smash) keeps uploading large files to S3 and bricks dalco

mrnicegyu11 commented 2 years ago

On 09Feb2022, there was data loss on dalco-production. Investigation led to show that the minio-cluster of dalco was in a very bad state (continuous inter-node communication, file-healing, some storage nodes left the swam, high network activity, long disk i/o times).

This was triggered by an upload of almost 1Tb of files in total, starting 4pm. The user stopped working on the study at approx 6pm, when she left the office and left oSparc tabs open. This file upload was originating from a sim4life jupyter smash dynamic service in Melanie's study, who continuously uploaded the same file (output_1.zip), over and over. The file has 14Gb for each version. Minio/S3 recorded each of these uploads in its versioning system. I checked versions v35, v93 and v97 individually.

To find: In minio mc tool with alias dalco properly set, run mc ls dalco/production-simcore/38a2e328-4c5c-11ec-854c-02420a0b01d2 --versions -r | sort.

Their content and md5sum is identical. The uploads likely stopped when minio was in such a bad state that two out of 4 nodes went down simulateously, quorum was lost, and the storage cluster was unreachable for some time. IO activity on minio was happening until midnight (see grafana), the last version of the file was noted in minio as modified at [2022-02-08 22:24:46 UTC].

Possible follow-ups:

Investigate why the notebook decided to upload output_1.zip over and over again.
Don't reupload a file version if its hash is identical to a previously stored version
Handle only smaller files on oSparc.

Caution:

We dont have many Terrabytes on dalco, so if this happenes again a few times, our storage will be full.

mrnicegyu11 commented 2 years ago

The continous data uploads can be seen in the logs of the dynamic-service container, to find this search graylog for container_name: /.*a0482e91-d252-4c1a-ad47-457e213592e7.*/ at Feb08 from 2pm to midnight.

GitHK commented 2 years ago

Thanks for the details and accurate description. I can already point out the issue.

All the output directory of the study are being monitored by a file system event watcher. Whenever you do something that changes a file inside this directory an event will be triggered, example of events:

opening or closing a file
appending to the file
writing to the file
removing a file
creating a file

What was done to prevent this?

To avoid triggering too many upload requests a feature (let's call it "stale detector") was implemented. If after 1 second all write activity has ceased an upload request is started.
Another situation that could occur was multiple parallel uploads of the same file. To avoid this one a different feature (let's call it "wait in queue") was implemented. If there are multiple uploads requests detected of the same file, they get queued up.

So if you combine 1. with 2. you get in the described situation. Most likely the issue was caused by writing directly to the outputs directory of a very big file (or with slower disks by copying to the output directory). The best solution would be to write outside the outputs folder and then move the content to the output port. Because moving will avoid data copy and only change the inode pointer.

@sanderegg @pcrespov we need a different strategy on how to handle this detection. Users can brake it way too easily.

mrnicegyu11 commented 2 years ago

Just to follow up: The file output_1.zip was identical (as checked by diff on the unzipped files and md5sum on the zipped file) in all uploaded versions, so this file was not modified/appended/... . Also, the user stopped working on the study so there should have been no activity in this particular case

GitHK commented 2 years ago

What I was trying to say is that the multiple uploads were triggered by writing some data only once to the outputs. Multiple upload requests were queued up and the same content was uploaded over and over again. That's why you have the same checksum.

mguidon commented 2 years ago

The user was running a sim4life simulation with the results folder in the output directory. The solver opens/appends/closes the corresponding output files every now and then, which triggers above chain of actions. I mentioned already some times in the support channel that it is a "Best Practice" to have the results folder in the work directory while iSolve runs. But yes, we need to redesign this behavior.

ITISFoundation / osparc-simcore

Dynamic service (jupyter smash) keeps uploading large files to S3 and bricks dalco #2815