CDLUC3 / mrt-doc

Documentation and Information regarding the Merritt repository
8 stars 4 forks source link

358G archive file triggered Nagios alerts on 7/6 in the evening. #394

Closed terrywbrady closed 4 years ago

terrywbrady commented 4 years ago
dloy commented 4 years ago

The storage problems are caused by presign object requests. 90% of the temp space used by the storage servers were either in progress or failed archive directories. In one case a single object used >358G / 1T. One problem of this process is that any directory will use about 2 x the resulting zip file because all components need to be staged before incorporated into the resulting container file (e.g. zip)

Eventually a queuing system for presign object would be useful for throttling the process and keeping simultaneous zip down to prevent temp locking problems.

As a short term solution, I'm proposing that there be 4 storage servers in production, 2 would be specifically dedicated to ingest and 2 would be dedicated to archive creation by UI (either presign or direct).

I view ingest storage handling as having a higher priority than archive creation. This would guarantee that archive handling would not interfere in the ingest processing.

terrywbrady commented 4 years ago

@elopatin-uc3 , I think we can close this.

elopatin-uc3 commented 4 years ago

Yes, agreed @terrywbrady; closing