daitss / core

DAITSS: Dark Archive In The Sunshine State
GNU General Public License v3.0
9 stars 2 forks source link

submit command results in memory errors and rejects with <EOFError: end of file reached> invalid SIP descriptor errors and defunct processes #777

Open lydiam opened 8 years ago

lydiam commented 8 years ago

I submitted the following command on fclnx30:

   sudo -b -u daitss submit --username lydiam --password fclad00d2 --source FTPDL --path /var/daitss/ops/incoming/ftpdl/FTPDL_20151212T090803 --batch FTPDL_20151212T090803

The result was that 6-8 of about 144 SIPs submitted successfully and the remainder rejected with the following error:

   sip path: /var/daitss/ops/incoming/ftpdl/FTPDL_20151212T090803/AA00036241_00001; 
   Invalid SIP descriptor. XML validation errors:
   line:unknown: msg:#<EOFError: end of file reached>

The submit log contained the "standard" log messages, with the majority of packages being rejected and some being submitted.:

   2015-12-14 12:57:47 INFO 2015-12-14 12:57:47 -0500 -- AA00036194_00001 -- rejected: E14HY8Q8F_RM36BJ
   2015-12-14 12:57:50 INFO 2015-12-14 12:57:50 -0500 -- AA00036192_00001 -- submitted successfully: EIQJ0H9KI_IKIHUJ
   2015-12-14 12:57:50 INFO 2015-12-14 12:57:50 -0500 -- AA00037821_00001 -- rejected: E30TCY4OV_ONW141

At the same time I go memory errors in STDOUT and logs named hs_err_pid*.log, with the following contents: [lydiam@fclnx30 ftpdl]$ more hs_err_pid15104.log #

There is insufficient memory for the Java Runtime Environment to continue.

   # Native memory allocation (malloc) failed to allocate 32744 bytes for ChunkPool::allocate
   # Possible reasons:
   #   The system is out of physical RAM or swap space
   #   In 32 bit mode, the process size limit was hit
   # Possible solutions:
   #   Reduce memory load on the system
   #   Increase physical memory or swap space
   #   Check if swap backing store is full
   #   Use 64 bit Java on a 64 bit OS
   #   Decrease Java heap size (-Xmx/-Xms)
   #   Decrease number of Java threads
   #   Decrease Java thread stack sizes (-Xss)
   #   Set larger code cache with -XX:ReservedCodeCacheSize=
   # This output file may be truncated or incomplete.
   #
   #  Out of Memory Error (allocation.cpp:214), pid=15104, tid=47784611424576
   #
   # JRE version: OpenJDK Runtime Environment (7.0_85-b01) (build 1.7.0_85-mockbuild_2015_07_13_18_00-b00)
  # Java VM: OpenJDK 64-Bit Server VM (24.85-b03 mixed mode linux-amd64 compressed oops)
  # Derivative: IcedTea 2.6.1
 # Distribution: Built on Red Hat Enterprise Linux Server release 5.11 (Tikanga) (Mon Jul 13 18:00:16 EDT 2015)
  # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

--------------- T H R E A D ---------------

Current thread (0x000000001a667000): JavaThread "C2 CompilerThread0" daemon [_thread_in_native, id=15130, stack(0x00002b75b8ba2000,0x00002b75b8ca3000)]

Stack: [0x00002b75b8ba2000,0x00002b75b8ca3000], sp=0x00002b75b8c9d440, free space=1005k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x9e3ece] V [libjvm.so+0x9e45fb] V [libjvm.so+0x4dcf85] V [libjvm.so+0x2fdea0] ....

These logs can be found in: /var/daitss/ops/incoming/ftpdl

There are also lots of defunct processes:

daitss 27706 27690 0 12:53 pts/0 00:00:00 [ruby] daitss 27727 27690 0 12:53 pts/0 00:00:00 [ruby] daitss 27745 27690 0 12:53 pts/0 00:00:00 [ruby] daitss 27766 27690 0 12:53 pts/0 00:00:00 [ruby] daitss 27787 27690 0 12:53 pts/0 00:00:00 [ruby] daitss 27809 27690 0 12:53 pts/0 00:00:00 [ruby]

And other daitss processes from Dec. 13:

daitss 23705 1 0 Dec13 ? 00:03:44 Rack: /opt/web-services/sites/core/current
daitss 23713 1 0 Dec13 ? 00:00:00 Rack: /opt/web-services/sites/core/current
daitss 23719 1 0 Dec13 ? 00:00:00 Rack: /opt/web-services/sites/core/current
daitss 23725 1 0 Dec13 ? 00:00:21 Rack: /opt/web-services/sites/core/current
daitss 23736 1 0 Dec13 ? 00:00:49 Rack: /opt/web-services/sites/actionplan/current
daitss 23743 1 0 Dec13 ? 00:00:00 Rack: /opt/web-services/sites/actionplan/current
daitss 23749 1 0 Dec13 ? 00:00:00 Rack: /opt/web-services/sites/actionplan/current
daitss 23755 1 0 Dec13 ? 00:00:17 Rack: /opt/web-services/sites/actionplan/current
daitss 23763 1 0 Dec13 ? 00:00:02 Rack: /opt/web-services/sites/storage-master/current
daitss 23773 1 0 Dec13 ? 00:00:02 Rack: /opt/web-services/sites/storage-master/current
daitss 23787 1 0 Dec13 ? 00:00:00 Rack: /opt/web-services/sites/transform/current

I will kill the defunct processes and attempt to restart DAITSS.

Gerald suggested attempting submission of one package at a time.

lydiam commented 8 years ago

After stopping and restarting DAITSS, ruby is still spawning all sorts of processes:

[lydiam@fclnx30 submit-direct]$ ps -ef | grep daitss postgres 5364 9871 96 04:22 ? 10:23:00 postgres: daitss2 daitss_db 127.0.0.1(60080) SELECT
root 6159 24970 0 14:29 pts/3 00:00:00 sudo su - daitss root 6163 6159 0 14:30 pts/3 00:00:00 su - daitss daitss 6164 6163 0 14:30 pts/3 00:00:00 -bash daitss 14663 1 0 15:00 ? 00:00:01 thin server (0.0.0.0:7002) [describe]
daitss 14675 1 0 15:00 ? 00:00:01 thin server (0.0.0.0:7003) [describe]
daitss 14681 1 4 15:00 ? 00:00:15 ruby /opt/web-services/sites/core/current/bin/pulse daitss 14686 1 0 15:00 ? 00:00:03 ruby ./bin/mailer-daemon postgres 14752 9871 0 15:00 ? 00:00:00 postgres: daitss2 daitss_db 127.0.0.1(56442) idle
daitss 14753 14681 7 15:00 ? 00:00:27 E87C3IH70_T30OFO.ingest
postgres 14762 9871 2 15:01 ? 00:00:08 postgres: daitss2 daitss_db 127.0.0.1(56444) idle
daitss 14770 14640 0 15:01 ? 00:00:00 Passenger ApplicationSpawner: /opt/web-services/sites/viruscheck/current
daitss 14779 1 18 15:01 ? 00:01:01 Rack: /opt/web-services/sites/viruscheck/current
daitss 14786 1 0 15:01 ? 00:00:00 Rack: /opt/web-services/sites/viruscheck/current
daitss 14792 1 0 15:01 ? 00:00:00 Rack: /opt/web-services/sites/viruscheck/current
daitss 14798 1 0 15:01 ? 00:00:00 Rack: /opt/web-services/sites/viruscheck/current
daitss 14895 14640 1 15:01 ? 00:00:04 Passenger ApplicationSpawner: /opt/web-services/sites/core/current
daitss 14908 1 2 15:01 ? 00:00:07 Rack: /opt/web-services/sites/core/current
daitss 14916 1 0 15:01 ? 00:00:00 Rack: /opt/web-services/sites/core/current
daitss 14925 1 0 15:01 ? 00:00:00 Rack: /opt/web-services/sites/core/current
daitss 14931 1 0 15:01 ? 00:00:00 Rack: /opt/web-services/sites/core/current
daitss 15871 14640 0 15:02 ? 00:00:01 Passenger ApplicationSpawner: /opt/web-services/sites/actionplan/current
daitss 16243 1 0 15:02 ? 00:00:00 Rack: /opt/web-services/sites/actionplan/current
daitss 16250 1 0 15:02 ? 00:00:00 Rack: /opt/web-services/sites/actionplan/current
daitss 16258 1 0 15:02 ? 00:00:00 Rack: /opt/web-services/sites/actionplan/current
daitss 16265 1 0 15:02 ? 00:00:00 Rack: /opt/web-services/sites/actionplan/current
daitss 16273 1 1 15:02 ? 00:00:02 Rack: /opt/web-services/sites/storage-master/current
daitss 16441 1 0 15:02 ? 00:00:02 Rack: /opt/web-services/sites/storage-master/current
daitss 16460 14640 0 15:02 ? 00:00:00 Passenger ApplicationSpawner: /opt/web-services/sites/transform/current
daitss 16466 1 0 15:02 ? 00:00:00 Rack: /opt/web-services/sites/transform/current
daitss 16473 1 0 15:02 ? 00:00:02 Rack: /opt/web-services/sites/storage-master/current
daitss 16485 1 0 15:02 ? 00:00:00 Rack: /opt/web-services/sites/transform/current
daitss 16491 1 0 15:02 ? 00:00:02 Rack: /opt/web-services/sites/silos/current
daitss 16501 1 0 15:02 ? 00:00:02 Rack: /opt/web-services/sites/silos/current
daitss 16520 1 0 15:02 ? 00:00:02 Rack: /opt/web-services/sites/storage-master/current
daitss 16528 1 0 15:02 ? 00:00:00 Rack: /opt/web-services/sites/transform/current
daitss 16531 14640 0 15:02 ? 00:00:00 Passenger ApplicationSpawner: /opt/web-services/sites/xmlresolution/current
daitss 16541 1 0 15:02 ? 00:00:00 Rack: /opt/web-services/sites/xmlresolution/current
daitss 16549 1 0 15:02 ? 00:00:00 Rack: /opt/web-services/sites/xmlresolution/current
daitss 16558 1 0 15:02 ? 00:00:00 Rack: /opt/web-services/sites/transform/current
daitss 16564 1 0 15:02 ? 00:00:02 Rack: /opt/web-services/sites/silos/current
daitss 16592 1 0 15:02 ? 00:00:00 Rack: /opt/web-services/sites/xmlresolution/current
daitss 16598 1 1 15:02 ? 00:00:02 Rack: /opt/web-services/sites/silos/current
daitss 16629 1 0 15:02 ? 00:00:00 Rack: /opt/web-services/sites/xmlresolution/current
lydiam 18520 21654 0 15:06 pts/0 00:00:00 grep daitss daitss 32406 10421 0 04:04 ? 00:00:00 crond daitss 32407 32406 0 04:04 ? 00:00:00 /bin/bash /usr/bin/run-parts /opt/fda/etc/daitss.daily daitss 32411 32406 0 04:04 ? 00:00:00 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t daitss 32412 32411 0 04:04 ? 00:00:00 /usr/sbin/postdrop -r daitss 32416 32407 0 04:04 ? 00:00:00 /bin/sh /opt/fda/etc/daitss.daily/run-collect-fixities daitss 32417 32407 0 04:04 ? 00:00:00 awk -v progname=/opt/fda/etc/daitss.daily/run-collect-fixities progname {????? print progname ":\n"????? progname="";???? }???? { print; } daitss 32418 32416 4 04:04 ? 00:31:51 collect-fixities
[lydiam@fclnx30 submit-direct]$

Ingest seems to be processing. Once all submitted packages are ingested I will attempt to submit one package at a time.

lydiam commented 8 years ago

Now the AIP descriptor validation failed with the error we were getting on Friday:

descriptor fails daitss aip xml validation (1 errors) unknown,: #

I reset the package and started manually. That worked on Friday.

Failed again with the same error. I will restart daitss again.

I restarted daitss and stopped pulse.

Restarted another package manually and it appears to be succeeding. Archived E30MZ6Y4F_63VOVM . Another package archived.

Other packages failed repeatedly at aip descriptor validation.

lydiam commented 8 years ago

Gerald suggested that collect-fixities be killed. I did that and issued a stop-cron command: sudo -u daitss touch /opt/fda/etc/stop-cron.

I stopped and started daitss again, and the rack processes still show up.

lydiam commented 8 years ago

Packages are failing repeatedly with multiple errors such as:

2015-12-14 15:55:14 -0500 message

descriptor fails daitss aip xml validation (1 errors) unknown,: #

trace

/opt/web-services/sites/core/releases/20141117161651/lib/daitss/proc/template/descriptor.rb:49:in validate_aip_descriptor' /opt/web-services/sites/core/releases/20141117161651/lib/daitss/proc/wip/ingest.rb:41:inblock in ingest' /opt/web-services/sites/core/releases/20141117161651/lib/daitss/proc/wip/journal.rb:16:in step' /opt/web-services/sites/core/releases/20141117161651/lib/daitss/proc/wip/ingest.rb:41:iningest' /opt/web-services/sites/core/releases/20141117161651/lib/daitss/proc/wip/process.rb:82:in block in spawn' /opt/web-services/sites/core/releases/20141117161651/lib/daitss/proc/wip/process.rb:66:infork' /opt/web-services/sites/core/releases/20141117161651/lib/daitss/proc/wip/process.rb:66:in spawn' /opt/web-services/sites/core/releases/20141117161651/app.rb:949:inblock in ' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1602:in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1602:inblock in compile!' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:966:in []' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:966:inblock (3 levels) in route!' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:985:in route_eval' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:966:inblock (2 levels) in route!' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1006:in block in process_route' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1004:incatch' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1004:in process_route' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:964:inblock in route!' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:963:in each' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:963:inroute!' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1076:in block in dispatch!' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1058:inblock in invoke' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1058:in catch' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1058:ininvoke' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1073:in dispatch!' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:898:inblock in call!' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1058:in block in invoke' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1058:incatch' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1058:in invoke' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:898:incall!' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:886:in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-1.5.2/lib/rack/commonlogger.rb:33:incall' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:217:in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:945:inforward' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1019:in route_missing' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:980:inroute!' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:976:in route!' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1076:inblock in dispatch!' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1058:in block in invoke' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1058:incatch' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1058:in invoke' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1073:indispatch!' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:898:in block in call!' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1058:inblock in invoke' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1058:in catch' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1058:ininvoke' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:898:in call!' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:886:incall' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-protection-1.5.3/lib/rack/protection/xss_header.rb:18:in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-protection-1.5.3/lib/rack/protection/path_traversal.rb:16:incall' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-protection-1.5.3/lib/rack/protection/json_csrf.rb:18:in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-protection-1.5.3/lib/rack/protection/base.rb:49:incall' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-protection-1.5.3/lib/rack/protection/base.rb:49:in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-protection-1.5.3/lib/rack/protection/frame_options.rb:31:incall' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-1.5.2/lib/rack/nulllogger.rb:9:in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-1.5.2/lib/rack/head.rb:11:incall' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:180:in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:2014:incall' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-protection-1.5.3/lib/rack/protection/xss_header.rb:18:in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-protection-1.5.3/lib/rack/protection/base.rb:49:incall' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-protection-1.5.3/lib/rack/protection/base.rb:49:in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-protection-1.5.3/lib/rack/protection/path_traversal.rb:16:incall' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-protection-1.5.3/lib/rack/protection/json_csrf.rb:18:in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-protection-1.5.3/lib/rack/protection/base.rb:49:incall' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-protection-1.5.3/lib/rack/protection/base.rb:49:in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-protection-1.5.3/lib/rack/protection/frame_options.rb:31:incall' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-1.5.2/lib/rack/session/abstract/id.rb:225:in context' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-1.5.2/lib/rack/session/abstract/id.rb:220:incall' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-1.5.2/lib/rack/nulllogger.rb:9:in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-1.5.2/lib/rack/head.rb:11:incall' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/rack-1.5.2/lib/rack/methodoverride.rb:21:in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:180:incall' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:2014:in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1478:inblock in call' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1788:in synchronize' /opt/web-services/sites/core/shared/bundle/ruby/1.9.1/gems/sinatra-1.4.5/lib/sinatra/base.rb:1478:incall' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/rack/request_handler.rb:97:in process_request' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/abstract_request_handler.rb:520:inaccept_and_process_next_request' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/abstract_request_handler.rb:274:in main_loop' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/rack/application_spawner.rb:206:instart_request_handler' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/rack/application_spawner.rb:171:in block in handle_spawn_application' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/utils.rb:470:insafe_fork' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/rack/application_spawner.rb:166:in handle_spawn_application' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/abstract_server.rb:357:inserver_main_loop' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/abstract_server.rb:206:in start_synchronously' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/abstract_server.rb:180:instart' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/rack/application_spawner.rb:129:in start' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/spawn_manager.rb:253:inblock (2 levels) in spawn_rack_application' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/abstract_server_collection.rb:132:in lookup_or_add' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/spawn_manager.rb:246:inblock in spawn_rack_application' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/abstract_server_collection.rb:82:in block in synchronize' :10:insynchronize' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/abstract_server_collection.rb:79:in synchronize' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/spawn_manager.rb:244:inspawn_rack_application' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/spawn_manager.rb:137:in spawn_application' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/spawn_manager.rb:275:inhandle_spawn_application' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/abstract_server.rb:357:in server_main_loop' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/lib/phusion_passenger/abstract_server.rb:206:instart_synchronously' /opt/ruby-1.9.3-p545/lib/ruby/gems/1.9.1/gems/passenger-3.0.21/helper-scripts/passenger-spawn-server:99:in ` ' Reset and restarted another package and it seems to be working past the aip descriptor validation point.

lydiam commented 8 years ago

Since packages aren't ingesting I tried to submit SIPs one at a time and got the java runtime error and a reject message: [lydiam@fclnx30 FTPDL_20151212T090803]$ sudo -b -u daitss submit --username lydiam --password fclad00d2 --source FTPDL --package /var/daitss/ops/incoming/ftpdl/FTPDL_20151212T090803/AA00025584_00046 --batch FTPDL_20151212T090803 [lydiam@fclnx30 FTPDL_20151212T090803]$ #

There is insufficient memory for the Java Runtime Environment to continue.

   # pthread_getattr_np
   # An error report file with more information is saved as:
   # /var/daitss/ops/incoming/ftpdl/FTPDL_20151212T090803/hs_err_pid6107.log
jonpitts commented 8 years ago

It has been a while but I seem to remember we setup daitss to spawn multiple occurrences of specific web services to help with load. Such as more than one description service.

Some of the other processes listed above are spawned by daitss such as a mailer daemon and connection to the database. Also, anytime a user logs into daitss Apache will spawn a core process for that user. Not saying that there isn't something wrong with the number of daitss processes but its too hard for me to tell.

I would focus on the JVM error. One of two things must have occured for that error to generate. The system either ran out of Swap memory having gone through the other forms of memory first. Or the JVM settings need to be tweaked in the daitss config file.

Daitss relies on the JRE(Java Runtime Environment) for ruby gem RJB https://rubygems.org/gems/rjb/versions/1.5.4 to do xml_validation.

We use it here: https://github.com/daitss/core/blob/master/lib/daitss/proc/xmlvalidation.rb#L54

In the daitss config you will need to look at this line: https://github.com/daitss/core/blob/master/daitss-config.example.yml#L78

Xms2G - the starting heap size is 2GB Xmx2G - the max heap size is 2GB

Basically, every JVM spawned will take up 2GB of memory regardless of what is needed. You may want to decrease the starting heap size to Xms256m and increase the max heap size to Xmx4G. This could help with memory allocation. It will allow a single JVM process to grow larger than 2GB if needed but keep other JVM processes smaller. Keep in mind this is specific to the use of the JRE in daitss and not other processes on the daitss server using a JVM.

jonpitts commented 8 years ago

Something to think about - if there are other memory errors not relating to the JRE that can indicate a host of other potential problems. You will need a baseline. Stop daitss and look at the memory usage. Does it look normal?

Check swap space also: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/5/html/Tuning_and_Optimizing_Red_Hat_Enterprise_Linux_for_Oracle_9i_and_10g_Databases/sect-Oracle_9i_and_10g_Tuning_Guide-Swap_Space-Checking_Swap_Space_Size_and_Usage.html

Fire things up and see what memory looks like. It may be a good idea to disable GUI submission or better yet disable the UI completely. Submit packages from the command line and see what's going on with memory.

lydiam commented 8 years ago

Current settings in /opt/web-services/conf.d/daitss-config.ymljjj

jvm_options:

describe.fda.fcla.edu: log_syslog_facility: LOG_LOCAL0 max_pdf_bitstreams: 100 pdfa_validator: /opt/pdfapilot/pdfaPilot -a $INPUT_FILE$ --noprogress --nohits --level=$LEVEL$ --report=XML,ALWAYS,PATH=$REPORT_FILE$ jvm_options:

(no Xms setting at all)

lydiam commented 8 years ago

NWRDC has just rebooted fclxn30, and daitss processes and pulse started up. Pulse started ingest on

 daitss   10884 10398 33 09:04 ?        00:05:15 E38C3ZIKV_YDQ7FX.ingest

Thats a 7GB+ package, so it will take some time to process. I've stopped pulse. I will not attempt submission on any new packages right now. There are no new packages ftp-ed to eclipsep yesterday.

Actual memory use is 6% and 0% swap, with physical 44.9%

I'm guessing that it's just trying to store the package, because all other steps were completed yesterday:

make aip descriptor 2015-12-14T14:03:49-05:00 112.63
validate aip descriptor 2015-12-14T14:05:41-05:00 007.29
make tarball 2015-12-14T14:05:49-05:00 070.93

I'm seeing some Passenger processes that are unfamiliar to me:

 root     10322 10307  0 09:04 ?        00:00:00 PassengerWatchdog
 root     10325 10322  1 09:04 ?        00:00:20 PassengerHelperAgent
 root     10327 10325  0 09:04 ?        00:00:00 Passenger spawn server

This package errored out when DAITSS was shut down while it was storing.

jonpitts commented 8 years ago

Xms setting is optional. Leaving it off will start the JVM with a heap size of 0. This is okay as long as there is memory to be allocated when the JVM needs it. It will only allocate up to the Xmx setting.

I would run some of the packages that failed before and monitor the memory. If they succeed then its not really a problem with the JVM. It propbably means the JVM cannot allocate memory because there is none to be had. I'd be willing to bet fclxn30 needs more memory. If that is not an option I would look at increasing the swap size to compensate. It still may be a good idea to disable the GUI for a time.

lydiam commented 8 years ago

There doesn't appear to be any current option to stop only the daitss GUI, so we'd require a developer to provide that option.

jonpitts commented 8 years ago

There is an option for the operator to disable GUI submission from the UI. Affiliates may still try to download/ disseminate packages.

A sysadmin should be able to setup a type of redirect for anyone outside of your organization. That would keep people from being able to connect to the GUI.

lydiam commented 8 years ago

All packages currently in the workspace were manually restarted and successfully processed.

lydiam commented 8 years ago

Retried submission of a single package that had failed yesterday.

Failed again, with the same symptoms:

   [lydiam@fclnx30 UF00028293_00564]$ sudo -b -u daitss submit --username lydiam --password fclad00d2 --source FTPDL --package /var/daitss/ops/incoming/ftpdl/FTPDL_20151212T090803/AA00025584_00046
   [sudo] password for lydiam: 
   [lydiam@fclnx30 UF00028293_00564]$ #
   # There is insufficient memory for the Java Runtime Environment to continue.
   # Native memory allocation (malloc) failed to allocate 16 bytes for AllocateHeap
   # An error report file with more information is saved as:
   # /var/daitss/ops/incoming/ftpdl/FTPDL_20151212T090803/UF00028293_00564/hs_err_pid6121.log

Xymon reports memory usage at 6% actual.

Attempted to submit one of the FDA_benchmarks packages with the same results:

   [lydiam@fclnx30 ETDs]$ sudo -b -u daitss submit --username lydiam --password fclad00d2 --source benchmarks --package /var/daitss/ops/FDA_benchmarks/ETDs/FDAE001612

[lydiam@fclnx30 ETDs]$ #

There is insufficient memory for the Java Runtime Environment to continue.

   # pthread_getattr_np
   # An error report file with more information is saved as:
   # /var/daitss/ops/FDA_benchmarks/ETDs/hs_err_pid7425.log
lydiam commented 8 years ago

GUI submission for the above benchmark package succeeded: E4P2B1PWE_5W0LH1

lydiam commented 8 years ago

Downloading SIPs and submitting them via the GUI, where possible, and starting them manually.
Package AA00025584_00046, rejected 3 times via command-line submit, was successfully manually submitted via the GUI and successfully archived after manually starting the package.

There appears to be some problem with the command-line submit and pulse processes. What's the difference between a package started manually and one started by pulse?

lydiam commented 8 years ago

Bill K. experimented with submit, submitting as Stephen's user and starting pulse as Stephen's user and also removed the memory limit from my user. I resubmitted the batch of records and they appear to have submitted successfully and are now being successfully processed.

So my biggest problem appears to have been the memory limit on my user. Bill thinks that there are multiple problems, including a possibly memory leak. More details to follow.

lydiam commented 8 years ago

Bill's assessment of the problem:

We actually have 3 different issues. One is the memory problems upon submission. I think we have resolved this memory issue. There was a parameter in Lydia’s .bash_profile that was limiting her memory usage. This seems to have resolved this issue, but want to wait a little while longer before declaring victory.

The second issue, the EOF error, has been an intermittent problem for a while now according to Lydia. There are no filesystem, CPU or memory constraints on the server during the time the error occurs. It does not seem to depend on who submitted the package to the app, who started the app, or how long the server has been up.

The third is another memory issue, in which all of the server memory is consumed by the daitss processes. This looks to be a memory leak in the application.

I do not think the second and third problems are hardware or OS related. I have exhausted my limited Ruby skills looking at the problem.

lydiam commented 8 years ago

I’ve mostly cleaned up the mess created from multiple submissions/rejects of UF records and submitted a batch of FIU records that were FTP-ed on Friday and am getting the same submit memory error again. Only 2 packages submitted successfully, and 18 got the same error:

sip path: /var/daitss/ops/incoming/nonUF_UFAD/FIU/FI15042500; Invalid SIP descriptor. XML validation errors: line:unknown: msg:#

Actual memory is at 21% and physical at 79%

[lydiam@fclnx30 ops]$ sudo -b -u daitss submit --username lydiam --password fclad00d2 --source FTPDL --path /var/daitss/ops/incoming/nonUF_UFAD/FIU --batch FTPDL_20151212T090803_FIU [sudo] password for lydiam: [lydiam@fclnx30 ops]$ #

There is insufficient memory for the Java Runtime Environment to continue.

 # Native memory allocation (malloc) failed to allocate 16 bytes for AllocateHeap
 # An error report file with more information is saved as:
 # /var/daitss/ops/hs_err_pid25083.log
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # pthread_getattr_np
 # An error report file with more information is saved as:
 # /var/daitss/ops/hs_err_pid25106.log
 #
 # There is insufficient memory for the Java Runtime Environment to continue.
 # Native memory allocation (malloc) failed to allocate 41 bytes for AllocateHeap
 # An error report file with more information is saved as:
 # /var/daitss/ops/hs_err_pid25127.log
jonpitts commented 8 years ago

Assuming the same packages fail again and again with the same Java error and it doesn't appear the physical memory capped out maybe its time to try increasing the JVM memory setting.

Instead of -Xmx4096m try -Xmx8G. That will allow the JVM to grab more memory if needed. If it has no impact then switch it back.

Also, is there something about these packages that are in common? Stephen seemed to indicate one of the affiliates is sending large/ unusual packages. Are there a lot of .mov files in these packages?

lydiam commented 8 years ago

Retried the submit command with a new session. I likely used an old session to do the submission, one that didn't register the memory changes that Bill K. made for my user.

jonpitts commented 8 years ago

It may be a worthwhile attempt to find out what user Apache runs as and determine if it has any of those same memory restrictions. The user will typically be "apache" or "nobody" and is usually configured in an httpd.conf file. It could even be configured to run as the "daitss" user.

cchou commented 8 years ago

I remember the EOF error is what the xml parser returns whenever the parser has a problem parsing an XML file. We used get this error with those invalid characters in the descriptors. The work done in submit has fixed most of the EOF, however, there could be some additional scenarios which can cause XML parsing to fail.