daitss / core

DAITSS: Dark Archive In The Sunshine State
GNU General Public License v3.0
9 stars 2 forks source link

Storage error for large package #782

Closed szanati closed 7 years ago

szanati commented 8 years ago

I received the following error for a large package, AA00040592_00001 (EK17K3IRR_9EYDQW) that is 98GB and only has 437 files:

2016-06-08 14:31:53 -0400

bad status http://storage-master.fda.fcla.edu:70/packages/EK17K3IRR_9EYDQW.002: 500 500 Internal Service Error - Store of package EK17K3IRR_9EYDQW.002 to http://silos.darchive.fcla.edu:70/create/EK17K3IRR_9EYDQW.002 failed with status 500 - Internal Server Error; response from server was:

Internal Server Error

The server encountered an internal error or misconfiguration and was unable to complete your request.

Please contact the server administrator, root@localhost and inform them of the time the error occurred, and anything you might have done that may have caused the error.

More information about this error may be available in the server error log.

Apache Server at silos.darchive.fcla.edu Port 70

Before it errored out xymon gave an alert about the /var/daitss/tmp: /var/daitss/tmp (90% used) has reached the PANIC level (90%)

The storage log from yesterday:

2016 Jun 8 11:23:44 fclnx30 StorageMaster[32034]: INFO storage-master.fda.fcla.edu: Request Received: 104964638720 192.168.36.60 - - "PUT /packages/EK17K3IRR_9EYDQW.002 HTTP/1.1" 2016 Jun 8 14:10:34 fclnx30 SiloPool[32110]: INFO silos.darchive.fcla.edu: Request Received: 104964638720 192.168.36.60 - - "POST /create/EK17K3IRR_9EYDQW.002 HTTP/1.1" 2016 Jun 8 14:29:47 fclnx30 StorageMaster[32034]: ERROR storage-master.fda.fcla.edu: 500 Internal Service Error - Store of package EK17K3IRR_9EYDQW.002 to http://silos.darchive.fcla.edu:70/create/EK17K3IRR_9EYDQW.002 failed with status 500 - Internal Server Error; response from server was: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">#012#012500 Internal Server Error#012#012

Internal Server Error

#012

The server encountered an internal error or#012misconfiguration and was unable to complete#012your request.

#012

Please contact the server administrator,#012 root@localhost and inform them of the time the error occurred,#012and anything you might have done that may have#012caused the error.

#012

More information about this error may be available#012in the server error log.

#012
#012
Apache Server at silos.darchive.fcla.edu Port 70
#012. 192.168.36.60 - - "PUT /packages/EK17K3IRR_9EYDQW.002 HTTP/1.1" 2016 Jun 8 14:29:47 fclnx30 StorageMaster[32034]: INFO storage-master.fda.fcla.edu: Rack: 192.168.36.60 - - [08/Jun/2016 14:29:47] "PUT /packages/EK17K3IRR_9EYDQW.002 " 500 814 11163.7796

I reset it and it ran again but errored out with the same above message. It looks the same as a github issue https://github.com/daitss/core/issues/639 from 2012. In that case the package was 200GB in size.

I wonder if even a 98GB package is too big. We limit to 100GB. We usually do not get too many near that size. I have kept the original package incase for some reason I have to resubmit the package.

cchou commented 8 years ago

How much space do you have on /var/daitss/tmp before you started this package? DAITSS storage tends to use lots of tmp space due to its design flaw. Even though the designer indicate storage will take 2 times of package size plus head room for temp space during package processing, in our experience, we were not able to archive a 200 GB of packages with over 600 GB of temp space available. So, storage probably use at least more than 3 times of the package size for temp space.

szanati commented 8 years ago

According to Xymon around the time the above package errored, xymon showed for /var/daitss/tmp at 90%. Used: 727634640 Available: 636951656 Capacity: 75905040.

cchou commented 8 years ago

I am not sure I understand that number, it can't be 727GB used, 636GB Available with 75GB capacity.

I just look at darchive, /var/daitss/tmp has a total 694GB. With 60% used, there are 273GB available.

Size Used Avail Use% Mounted on 694G 408G 273G 60% /var/daitss/tmp

Need to know what much available space /var/daitss/tmp has before the problem package started. I would suggest at least 400GB available, the more is better.

szanati commented 8 years ago

Maybe the numbers in xymon are off but here is what xymon was reporting for that date:

Wed Jun 8 14:12:39 EDT 2016 - Filesystems NOT ok

red /var/daitss/tmp (90% used) has reached the PANIC level (90%)

Filesystem 1024-blocks Used Available Capacity Mounted on /dev/mapper/VolGroup00-LogVol01 2031440 873272 1053312 46% / /dev/mapper/VolGroup00-LogVol06 8125880 4664200 3042304 61% /var /dev/mapper/VolGroup00-LogVol07 10157368 7439396 2193788 78% /home /dev/mapper/VolGroup00-LogVol05 2031440 312828 1613756 17% /tmp /dev/mapper/VolGroup00-LogVol02 4062912 2073016 1780184 54% /usr /dev/mapper/VolGroup00-LogVol03 4062912 277236 3575988 8% /usr/local /dev/mapper/VolGroup00-LogVol04 9141624 5601256 3071852 65% /opt /dev/sda1 101086 28980 66887 31% /boot /dev/mapper/daitss--vg16-var--log--daitss--lv1 33023856 7690996 24661856 24% /var/log/daitss /dev/mapper/var--lib--mysql--vg0-var--lib--mysql--lv0 206420664 158480556 43745888 79% /var/lib/mysql /dev/mapper/daitss--vg14-mysql--backup--lv1 206420664 157955460 44270984 79% /var/lib/mysql_backup /dev/mapper/var--daitss--vg2-var--daitss--lv2 8256719876 4619509564 3553371028 57% /var/daitss /dev/mapper/vg_daitss_tmp-lv_daitss_tmp 727634640 636951656 75905040 90% /var/daitss/tmp /dev/mapper/var_lib_pgsql_data_vg-var_lib_pgsql_data_lv 355038180 312621756 24384144 93% /var/lib/pgsql/data /dev/mapper/var_lib_pgsql_data_vg-var_lib_pgsql_backups_lv 344716968 77392896 249815680 24% /var/lib/pgsql/backups /dev/mapper/silo1_vg-silo1_lv 2064180384 2043209260 260 100% /daitssfs/153 /dev/mapper/silo2_vg-silo2_lv 2064180384 973171232 1070038288 48% /daitssfs/154 /dev/mapper/fixity0_vg-fixity0_lv 2113722288 2043207492 49040616 98% /daitssfs/fixity/0 /dev/mapper/silo0_vg-silo0_lv 2064180384 203152 2043006368 1% /daitssfs/155

cchou commented 8 years ago

That's OK, Xymon display can be confusing to read. FYI, the previous Xymon message indicates /var/daitss/tmp is 90% full with around 75GB available (727634640 636951656 75905040 90% /var/daitss/tmp.) Storage Service saves at least a copy of the package into /var/daitss/tmp before storing it into the disk/tape, so if there is only 75GB available in /var/daitss/tmp, DAITSS won't be able to store a 100GB package into storage.

I would suggest to stop pulse, clean up /var/daitss/tmp and make sure there is no ghost file taking up space and there are at least 400Gb available in /var/daitss/tmp, start that package and see if we can get it ingested.

szanati commented 7 years ago

I was able to finally archive this package. It helped that Carol cleaned up /var/daitss/tmp. I will now close this case.