MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
46 stars 22 forks source link

v2 sarra downloads corrupted files only when debug is set to off #1274

Open andreleblanc11 opened 1 month ago

andreleblanc11 commented 1 month ago

Problem

There was an incident on friday where one of our clients changed their data source from SFTP to HTTPS.

To accomodate, we changed the destination from SFTP to HTTPS in our v2 poll, however the v2 poll didn't post messages so we converted it to sr3 (this is a whole separate issue but worth noting for context).

So we have a v3 poll now posting to a v2 sarra.

For some reason still unknown, the v2 sarra would download corrupted files from the v3 poll. The fix to prevent this data corruption was to specify debug on inside of the v2 sarra configuration.

Additional info

The v3 poll doesn't post the file size because it isn't being listed remotely.

2024-10-24 00:00:40,542 [DEBUG] sarracenia.flowcb.poll poll_file_post desc: type: <class 'paramiko.sftp_attr.SFTPAttributes'>, value: ?rwxr-xr-x   1 0        0               0 12 Oct 22:18 ?
2024-10-24 00:00:40,542 [DEBUG] sarracenia.flowcb.poll poll_file_post desc: type: <class 'paramiko.sftp_attr.SFTPAttributes'>, value: ?rwxr-xr-x   1 0        0               0 12 Oct 22:18 ?
2024-10-24 01:20:59,157 [INFO] sarracenia.flowcb.log after_post posted {  '_deleteOnPost':'{'exchange', 'post_topic', '_format', 'subtopic', 'local_offset', 'noDupe', 'old_format', 'mode', 'new_dir', 'new_relPath', 'post_exchange', 'new_file', 'new_baseUrl', 'new_subtopic', 'post_format'}', '_format':'v02', 'baseUrl':'https://repo.gportal.jaxa.jp', 'exchange':'['xs_JAPAN-JAXA']', 'from_cluster':'DDSR.CMC-DEV', 'identity':'{  'method':'cod', 'value':'sha512' }', 'local_offset':'0', 'mode':'755', 'mtime':'20241024T011900', 'new_baseUrl':'https://repo.gportal.jaxa.jp', 'new_dir':'/standard/GCOM-W/GCOM-W.AMSR2/L1B/2/2024/10', 'new_file':'GW1AM2_202410232320_175D_L1SGBTBR_2220220.h5', 'new_relPath':'standard/GCOM-W/GCOM-W.AMSR2/L1B/2/2024/10/GW1AM2_202410232320_175D_L1SGBTBR_2220220.h5', 'new_subtopic':'['standard', 'GCOM-W', 'GCOM-W.AMSR2', 'L1B', '2', '2024', '10']', 'noDupe':'{  'key':'standard/GCOM-W/GCOM-W.AMSR2/L1B/2/2024/10/GW1AM2_202410232320_175D_L1SGBTBR_2220220.h5,20241024T011900', 'path':'standard/GCOM-W/GCOM-W.AMSR2/L1B/2/2024/10/GW1AM2_202410232320_175D_L1SGBTBR_2220220.h5' }', 'old_format':'v02', 'post_exchange':'xs_JAPAN-JAXA', 'post_format':'v02', 'post_topic':'v02.post.standard.GCOM-W.GCOM-W.AMSR2.L1B.2.2024.10', 'pubTime':'20241024T012059.135126591', 'relPath':'standard/GCOM-W/GCOM-W.AMSR2/L1B/2/2024/10/GW1AM2_202410232320_175D_L1SGBTBR_2220220.h5', 'source':'JAPAN-JAXA', 'subtopic':'['standard', 'GCOM-W', 'GCOM-W.AMSR2', 'L1B', '2', '2024', '10']', 'sundew_extension':'pull-japan-amsr2:JAPAN:AMSR2:BIN:', 'to_clusters':'ALL' }
2024-10-24 01:20:59,157 [INFO] sarracenia.flowcb.log after_post posted {  '_deleteOnPost':'{'exchange', 'post_topic', '_format', 'subtopic', 'local_offset', 'noDupe', 'old_format', 'mode', 'new_dir', 'new_relPath', 'post_exchange', 'new_file', 'new_baseUrl', 'new_subtopic', 'post_format'}', '_format':'v02', 'baseUrl':'https://repo.gportal.jaxa.jp', 'exchange':'['xs_JAPAN-JAXA']', 'from_cluster':'DDSR.CMC-DEV', 'identity':'{  'method':'cod', 'value':'sha512' }', 'local_offset':'0', 'mode':'755', 'mtime':'20241024T011900', 'new_baseUrl':'https://repo.gportal.jaxa.jp', 'new_dir':'/standard/GCOM-W/GCOM-W.AMSR2/L1B/2/2024/10', 'new_file':'GW1AM2_202410240009_175A_L1SGBTBR_2220220.h5', 'new_relPath':'standard/GCOM-W/GCOM-W.AMSR2/L1B/2/2024/10/GW1AM2_202410240009_175A_L1SGBTBR_2220220.h5', 'new_subtopic':'['standard', 'GCOM-W', 'GCOM-W.AMSR2', 'L1B', '2', '2024', '10']', 'noDupe':'{  'key':'standard/GCOM-W/GCOM-W.AMSR2/L1B/2/2024/10/GW1AM2_202410240009_175A_L1SGBTBR_2220220.h5,20241024T011900', 'path':'standard/GCOM-W/GCOM-W.AMSR2/L1B/2/2024/10/GW1AM2_202410240009_175A_L1SGBTBR_2220220.h5' }', 'old_format':'v02', 'post_exchange':'xs_JAPAN-JAXA', 'post_format':'v02', 'post_topic':'v02.post.standard.GCOM-W.GCOM-W.AMSR2.L1B.2.2024.10', 'pubTime':'20241024T012059.135178089', 'relPath':'standard/GCOM-W/GCOM-W.AMSR2/L1B/2/2024/10/GW1AM2_202410240009_175A_L1SGBTBR_2220220.h5', 'source':'JAPAN-JAXA', 'subtopic':'['standard', 'GCOM-W', 'GCOM-W.AMSR2', 'L1B', '2', '2024', '10']', 'sundew_extension':'pull-japan-amsr2:JAPAN:AMSR2:BIN:', 'to_clusters':'ALL' }

The sarra (with debug on) then gets the file size (somehow), notices its different from what is available remotely, and then downloads it.

2024-10-24 01:23:59,387 [ERROR] util/writelocal mismatched file length writing GW1AM2_202410240009_175A_L1SGBTBR_2220220.h5. Message said to expect 31843505
 bytes.  Got 31558972 bytes.
2024-10-24 01:23:59,388 [INFO] file_log downloaded to: /apps/sarra/public_data/20241024/JAPAN-JAXA/AMSR2/01/GW1AM2_202410240009_175A_L1SGBTBR_2220220.h5
2024-10-24 01:23:59,388 [INFO] post_log notice=20241024012339.402512312 http://ddsr-cmc-ops06.cmc.ec.gc.ca /20241024/JAPAN-JAXA/AMSR2/01/GW1AM2_202410240009
_175A_L1SGBTBR_2220220.h5 headers={'sundew_extension': 'pull-japan-amsr2:JAPAN:AMSR2:BIN:', 'source': 'JAPAN-JAXA', 'mtime': '20241024011900', 'sum': 's,5fb
423f87f8f6436767b6fb33a106ef024d387a23edf2df631d6da83c43a66a9592f4f80564b80ed0080dcb551d9e22b40f34a920d368efdc07530a04bfd1374', 'parts': '1,31843505,1,0,0',
 'from_cluster': 'DDSR.CMC', 'to_clusters': 'localhost'}

When the sarra didn't have debug on, a lot of the downloaded files had the same reported file size which was strange.

-rw-rw-r-- 1 sarra sarra 31709548 Oct 19 19:01 GW1AM2_202410191122_051A_L1SGBTBR_2220220.h5
-rw-rw-r-- 1 sarra sarra 31709548 Oct 19 19:02 GW1AM2_202410191212_067D_L1SGBTBR_2220220.h5
-rw-rw-r-- 1 sarra sarra 31709548 Oct 19 19:05 GW1AM2_202410191709_115D_L1SGBTBR_2220220.h5
-rw-rw-r-- 1 sarra sarra 32769288 Oct 19 19:34 GW1AM2_202410181715_108A_L1SGBTBR_2220220.h5

The next step in this saga is to try to port the v2 sarra to sr3. However, I've been testing an sr3 sarra on dev and the checksums we receive are different from what is on ops with the v2 sarra.

Unfortunately the stat stuff in the transfer class isn't available on OPS yet, and would likely help for this kind of problem.

andreleblanc11 commented 1 month ago

The work around was to port the sarra configuration to sr3. Even without the stat call it's able to download the files without corruption.