aodn / data-services

Scripts which are used to process incoming data in the data ingestion pipeline
GNU General Public License v3.0
1 stars 4 forks source link

Talend processing hangs #450

Closed julian1 closed 4 years ago

julian1 commented 8 years ago

10-aws-syd

root@10-aws-syd:~$ ps aux  | grep talend | grep ANMN
root     31039  0.0  0.0  50428  1848 pts/1    T    10:55   0:00 sudo -u talend /mnt/ebs/talend/bin/talend-trigger -c /mnt/ebs/talend/etc/trigger.conf --delete -f IMOS/ANMN/NRS/REAL_TIME/NRSYON/humidity/percentage_relative_humidity_channel_4100/2011/QAQC/IMOS_ANMN_M_20110101T000000Z_NRSYON_FV01_END-20110124T123928Z_C-20150325T105818Z.nc,IMOS/ANMN/NRS/REAL_TIME/NRSYON/humidity/percentage_relative_humidity_channel_4100/2011/QAQC/IMOS_ANMN_M_20110101T000000Z_NRSYON_FV01_END-20110124T123928Z_C-20150325T105818Z.nc
talend   31040  0.0  0.0  52432  9332 pts/1    Tl   10:55   0:00 ruby /mnt/ebs/talend/bin/talend-trigger -c /mnt/ebs/talend/etc/trigger.conf --delete -f IMOS/ANMN/NRS/REAL_TIME/NRSYON/humidity/percentage_relative_humidity_channel_4100/2011/QAQC/IMOS_ANMN_M_20110101T000000Z_NRSYON_FV01_END-20110124T123928Z_C-20150325T105818Z.nc,IMOS/ANMN/NRS/REAL_TIME/NRSYON/humidity/percentage_relative_humidity_channel_4100/2011/QAQC/IMOS_ANMN_M_20110101T000000Z_NRSYON_FV01_END-20110124T123928Z_C-20150325T105818Z.nc
talend   31042  0.0  0.0   4396   612 pts/1    T    10:55   0:00 sh -c /mnt/ebs/talend/jobs/anmn_nrs_dar_yon-anmn_nrs_dar_yon/java/ANMN_NRS_DAR_YON_harvester/ANMN_NRS_DAR_YON_harvester_run.sh --context_param paramFile="/mnt/ebs/talend/jobs/anmn_nrs_dar_yon-anmn_nrs_dar_yon/etc/anmn_nrs_dar_yon-anmn_nrs_dar_yon.conf" --context_param base=/tmp/d20160506-31040-1ozfy8x --context_param fileList=/tmp/harvester_file_list20160506-31040-16c82ee --context_param logDir=/tmp/d20160506-31040-1tsjkno
talend   31044  0.0  0.0   4396   608 pts/1    T    10:55   0:00 /bin/sh /mnt/ebs/talend/jobs/anmn_nrs_dar_yon-anmn_nrs_dar_yon/java/ANMN_NRS_DAR_YON_harvester/ANMN_NRS_DAR_YON_harvester_run.sh --context_param paramFile=/mnt/ebs/talend/jobs/anmn_nrs_dar_yon-anmn_nrs_dar_yon/etc/anmn_nrs_dar_yon-anmn_nrs_dar_yon.conf --context_param base=/tmp/d20160506-31040-1ozfy8x --context_param fileList=/tmp/harvester_file_list20160506-31040-16c82ee --context_param logDir=/tmp/d20160506-31040-1tsjkno
talend   31047  0.0  0.6 2648556 96964 pts/1   Tl   10:55   0:01 java -Xms256M -Xmx1024M -cp classpath.jar: anmn_nrs_dar_yon_ts.anmn_nrs_dar_yon_harvester_0_1.ANMN_NRS_DAR_YON_harvester --context_param paramFile=/mnt/ebs/talend/jobs/anmn_nrs_dar_yon-anmn_nrs_dar_yon/etc/anmn_nrs_dar_yon-anmn_nrs_dar_yon.conf --context_param base=/tmp/d20160506-31040-1ozfy8x --context_param fileList=/tmp/harvester_file_list20160506-31040-16c82ee --context_param logDir=/tmp/d20160506-31040-1tsjkno

Attempted to kill the connection client on db-prod without any result.

Had to kill manually on 10-aws-syd

julian1 commented 8 years ago

Might be an issue on the db size. tx is in rollback and stalled. Although killing it doesn't fix.


jfca@2-aws-syd:~$ sudo -u postgres psql -d harvest -c  "select now()-query_start as time,* from pg_stat_activity where usename = 'anmn_nrs_dar_yon'" | cat
      time       | datid | datname | pid  | usesysid |     usename      | application_name |  client_addr  | client_hostname | client_port |         backend_start         | xact_start |         query_start          |         state_change          | waiting | state | backend_xid | backend_xmin |  query   
-----------------+-------+---------+------+----------+------------------+------------------+---------------+-----------------+-------------+-------------------------------+------------+------------------------------+-------------------------------+---------+-------+-------------+--------------+----------
 00:00:04.942303 | 16415 | harvest | 1395 |    16472 | anmn_nrs_dar_yon |                  | 52.64.146.138 |                 |       57181 | 2016-05-06 01:59:12.236771+00 |            | 2016-05-06 02:03:53.04598+00 | 2016-05-06 02:03:53.046038+00 | f       | idle  |             |              | ROLLBACK
(1 row)

jfca@2-aws-syd:~$ 
julian1 commented 8 years ago

Appears that talend liquibase must not have completed, and there was still a lock on the table,

harvest=# select * from anmn_nrs_dar_yon.databasechangeloglock ; 
 id | locked |        lockgranted         |                             lockedby                              
----+--------+----------------------------+-------------------------------------------------------------------
  1 | t      | 2016-05-05 05:32:38.322+00 | fe80:0:0:0:4b5:54ff:fe7c:a779%2 (fe80:0:0:0:4b5:54ff:fe7c:a779%2)
(1 row)

Fixed with

harvest=# update  anmn_nrs_dar_yon.databasechangeloglock set locked = false where id = 1 ; 
UPDATE 1
julian1 commented 8 years ago

It's a worry that pg_locks didn't show the lock on rollback.

lbesnard commented 8 years ago

@julian1 The same thing happened again today. And it's my fault. I did a po_s3_del to delete a file, and typed ctr-C by mistake. This kills something the process, but the database hangs with the database changeloglock.

Do you thing you could run the same query again please ?

lbesnard commented 8 years ago

@julian1 the same thing happened by itself with the FAIMMS harvester (prob due to the db update ?). none of the files get processed. Could you run the same query on the faimms schema ? cheers

@smancini FYI, this is why there are a few errors on nagios

@lwgordonimos FYI so you can be aware of issues related to talend and liquibase locks on db

julian1 commented 8 years ago

@lbesnard done.

harvest=# update faimms.databasechangeloglock set locked = false where locked = true; 
UPDATE 1
harvest=# 
lbesnard commented 7 years ago

@ggalibert @bhasin85 FYI

This happened again with ACORN

julian1 commented 7 years ago

What a bunch of crap. I wonder what happens to files when a failed liquibase update kills the pipeline?

ggalibert commented 7 years ago

They go in the error directory.

lbesnard commented 7 years ago

@julian1 We're having a few pipeline issues at the moment. It looks like my FAIMMS harvester is blocked again

Caused by: liquibase.exception.LockException: Could not acquire change log lock.  Currently locked by fe80:0:0:0:4b5:54ff:fe7c:a779%2 (fe80:0:0:0:4b5:54ff:fe7c:a779%2) since 11/22/16 2:15 AM

Do you think you could run this again ?

harvest=# update faimms.databasechangeloglock set locked = false where locked = true; 
UPDATE 1
harvest=# 

@bpasquer are you also affected by this ?

lbesnard commented 7 years ago

@aodn/emii-ops anyone could do this please ?

jonescc commented 7 years ago

I'll have a look at this

jonescc commented 7 years ago

Locked since 21/11/16.

harvest=> select * from databasechangeloglock; id | locked | lockgranted | lockedby
----+--------+----------------------------+------------------------------------------------------------------- 1 | t | 2016-11-21 15:15:53.576+00 | fe80:0:0:0:4b5:54ff:fe7c:a779%2 (fe80:0:0:0:4b5:54ff:fe7c:a779%2)

jonescc commented 7 years ago

harvest=> update faimms.databasechangeloglock set locked = false where locked = true; UPDATE 1