Closed lbesnard closed 5 years ago
select distinct(url) from soop_xbt_nrt.soop_xbt_nrt_profiles_map where "TIME" < '1900-01-01T00:00:00Z' order by url
"IMOS/SOOP/SOOP-XBT/REALTIME/9V3581_Maersk-Jalan/200/IMOS_SOOP-XBT_T_2000107T230600Z_9V3581_002415_FV00.csv" "IMOS/SOOP/SOOP-XBT/REALTIME/9V3581_Maersk-Jalan/200/IMOS_SOOP-XBT_T_2000210T215200Z_9V3581_002423_FV00.csv" "IMOS/SOOP/SOOP-XBT/REALTIME/9V3581_Maersk-Jalan/200/IMOS_SOOP-XBT_T_2000210T215200Z_9V3581_002430_FV00.csv" "IMOS/SOOP/SOOP-XBT/REALTIME/9V3581_Maersk-Jalan/200/IMOS_SOOP-XBT_T_2000210T220200Z_9V3581_002424_FV00.csv" "IMOS/SOOP/SOOP-XBT/REALTIME/9V3581_Maersk-Jalan/200/IMOS_SOOP-XBT_T_2000210T220200Z_9V3581_002431_FV00.csv" "IMOS/SOOP/SOOP-XBT/REALTIME/9V3581_Maersk-Jalan/200/IMOS_SOOP-XBT_T_2000210T230300Z_9V3581_002425_FV00.csv" "IMOS/SOOP/SOOP-XBT/REALTIME/9V3581_Maersk-Jalan/200/IMOS_SOOP-XBT_T_2000210T230300Z_9V3581_002432_FV00.csv" "IMOS/SOOP/SOOP-XBT/REALTIME/9V3581_Maersk-Jalan/200/IMOS_SOOP-XBT_T_2000210T232000Z_9V3581_002426_FV00.csv" "IMOS/SOOP/SOOP-XBT/REALTIME/9V3581_Maersk-Jalan/200/IMOS_SOOP-XBT_T_2000210T232000Z_9V3581_002433_FV00.csv" "IMOS/SOOP/SOOP-XBT/REALTIME/9V3581_Maersk-Jalan/200/IMOS_SOOP-XBT_T_2000210T235400Z_9V3581_002427_FV00.csv" "IMOS/SOOP/SOOP-XBT/REALTIME/9V3581_Maersk-Jalan/200/IMOS_SOOP-XBT_T_2000210T235400Z_9V3581_002434_FV00.csv" "IMOS/SOOP/SOOP-XBT/REALTIME/9V3581_Maersk-Jalan/200/IMOS_SOOP-XBT_T_2000211T004300Z_9V3581_002435_FV00.csv" ...
the reason seems to be in the filename not containing enough characters in the time value around the year/month part: 2000210T230300Z
This query shows it affected files uploaded in 2016-03
select max(last_indexed) from soop_xbt_nrt.indexed_file where url IN (select distinct(url) from soop_xbt_nrt.soop_xbt_nrt_profiles_map where "TIME" < '1900-01-01T00:00:00Z' order by url)
@smancini do you know you dealt with this collection back then ? maybe we should contact the facility leader about this bug.
Also, the physical files aren't on S3. However I can always un-indexed them
FYI @ggalibert
I tried to delete all those files with the po_s3_del command. However the data is still in the database.
For example:
select * from soop_xbt_nrt.indexed_file where url like 'IMOS/SOOP/SOOP-XBT/REALTIME/FHZI_Astrolabe/200/IMOS_SOOP-XBT_T_2000221T041300Z_FHZI_003570_FV00.csv';
correctly shows it was deleted and last_indexed today "2019-05-02 02:42:23.828+00".
however the data is still there
select * from soop_xbt_nrt.soop_xbt_nrt_profiles_map where url like 'IMOS/SOOP/SOOP-XBT/REALTIME/FHZI_Astrolabe/200/IMOS_SOOP-XBT_T_2000221T041300Z_FHZI_003570_FV00.csv';
The SOOP_XBT_NRT harvester seems to be set up correctly to delete files:
"DELETE FROM attribute WHERE file_id = " + context.file_id + " ;
DELETE FROM measurements WHERE profile_id = " + context.file_id + " ;
DELETE FROM soop_xbt_nrt_profiles_map WHERE profile_id = " + context.file_id + " ;
DELETE FROM indexed_file where id = " + context.file_id + ";"
with a proper commit and tpostjob component. however, files stay in the _map table.
Would you have an idea @jonescc
I'll have a look.
The file was deleted a long time ago
select * from soop_xbt_nrt.file_harvest where file_id = 5478;
harvest=> select * from soop_xbt_nrt.file_harvest where file_id = 5478;
id | harvest_type | file_id | last_moddate_harvested | last_change_type | last_change_run
------+--------------+---------+------------------------+------------------+-----------------
5478 | harvest | 5478 | 2016-03-01 04:00:54+00 | deleted | 453
(1 row)
We are currently up to run 1771
harvest=> select * from soop_xbt_nrt.index_job;
id | name | last_run_no | last_run_date
----+-------+-------------+----------------------------
1 | index | 1771 | 2019-05-02 02:44:45.811+00
(1 row)
Not sure what happened back then and why the records weren't cleaned up at the time. Regardless, you would still expect that the del command should trigger the delete action and remove the records.
Unfortunately, up until recently, the delete action would only be triggered if a delete hadn't already been performed (checked the deleted flag on indexed_file). This was fixed in https://github.com/aodn/talend-components/pull/60, but only harvesters with the problem being fixed by this PR were rebuilt with the updated components (the GENERIC_TIMESTEP harvester).
If you force a rebuild of the harvester to pick up these changes, then the s3 del command should work for this file.
I kicked off a build for the harvester and have promoted it to RC for testing. Should be deployed shortly.
@jonescc
The new version of the harvester is on RC
talend_version soop_xbt_nrt-soop_xbt_nrt
GIT_COMMIT=a54b492319e94dec3504d2919fcf5f0b3c90c276
BUILD_URL=https://jenkins.aodn.org.au/job/talend_build/1372/
BUILD_TAG=jenkins-talend_build-1372
ROOT_BUILD_CAUSE=MANUALTRIGGER
BUILD_JOB_NAME=talend_build
BUILD_CAUSE_MANUALTRIGGER=true
BUILD_DISPLAY_NAME=#1372
BUILD_ID=1372
BUILD_CAUSE=MANUALTRIGGER
BUILD_NUMBER=1372
ROOT_BUILD_CAUSE_MANUALTRIGGER=true
I performed a po_s3_del
po_s3_del IMOS/SOOP/SOOP-XBT/REALTIME/FASB_Astrolabe/2017/IMOS_SOOP-XBT_T_20171227T200300Z_FASB_000450_FV00.csv
Deleting 'IMOS/SOOP/SOOP-XBT/REALTIME/FASB_Astrolabe/2017/IMOS_SOOP-XBT_T_20171227T200300Z_FASB_000450_FV00.csv' with index deletion
delete: 's3://imos-test-data/IMOS/SOOP/SOOP-XBT/REALTIME/FASB_Astrolabe/2017/IMOS_SOOP-XBT_T_20171227T200300Z_FASB_000450_FV00.csv'
however the file is still in the database (and set a deleted TRUE)
select * from soop_xbt_nrt.soop_xbt_nrt_profiles_map where url like 'IMOS/SOOP/SOOP-XBT/REALTIME/FASB_Astrolabe/2017/IMOS_SOOP-XBT_T_20171227T200300Z_FASB_000450_FV00.csv';
┌─[ RECORD 1 ]────┬───────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ profile_id │ 211 │
│ url │ IMOS/SOOP/SOOP-XBT/REALTIME/FASB_Astrolabe/2017/IMOS_SOOP-XBT_T_20171227T200300Z_FASB_000450_FV00.csv │
│ Callsign │ FASB │
│ vessel_name │ Astrolabe │
│ TIME │ 2017-12-27 09:03:00+00 │
│ nb_measurements │ 30 │
│ max_depth │ 48 │
│ colour │ #155B90 │
│ geom │ 0101000020E61000002FBBED63D1536240EA2532AAA6C847C0 │
│ size │ 1154 │
└─────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────┘
select * from soop_xbt_nrt.indexed_file where url like 'IMOS/SOOP/SOOP-XBT/REALTIME/FASB_Astrolabe/2017/IMOS_SOOP-XBT_T_20171227T200300Z_FASB_000450_FV00.csv';
┌─[ RECORD 1 ]─────┬───────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ id │ 211 │
│ job_id │ 1 │
│ url │ IMOS/SOOP/SOOP-XBT/REALTIME/FASB_Astrolabe/2017/IMOS_SOOP-XBT_T_20171227T200300Z_FASB_000450_FV00.csv │
│ created │ ∅ │
│ modified │ 2019-01-31 04:32:02+00 │
│ first_indexed │ 2019-01-31 04:33:56.91022+00 │
│ last_indexed │ 2019-05-06 02:22:19.441+00 │
│ last_indexed_run │ 6 │
│ size │ 1154 │
│ deleted │ t │
└──────────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────┘
So it doesn't look like it worked
Hmm. Talend craziness. In the source code the connection to use for the commit is not set (see https://github.com/aodn/harvesters/blob/master/workspace/SOOP_XBT_NRT/process/SOOP_XBT_NRT_subjobs/deleteData_0.1.item#L90). But when you go into the harvester and look at it it shows tPostgresqlConnection_1. I think its just showing the options available but it it hasn't actually been set. I selected it and needed to turn close connection off and on to get the save button to come on so I could save the changes and then it worked.
Also note that the test scripts we are currently writing for the talend upgrade would have picked up this problem as we try to exercise each harvester path (delete/update/add)
cleaned
I discovered by chance 3+ xbt nrt voyages with times around the year 200 A.D. https://portal.aodn.org.au/search?uuid=35234913-aa3c-48ec-b9a4-77f822f66ef8
currently investigating the reasons why