MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
44 stars 22 forks source link

v2 sender unable to decode corrupted filename on ftp connection #1198

Closed andreleblanc11 closed 3 days ago

andreleblanc11 commented 2 weeks ago

What happened

(Only confirmed to be affecting V2.24.08. We also know it doesn't affect V2.21.04)*

*UPDATE: it actually DOES affect v2.21.04... but only when running on python >= 3.9 so ubuntu 22.04. the old version is safe only because it is running on python 3.6 on ubuntu 18.04.

A delivery failed when a sender tried to transfer a file.

2024-08-26 22:46:32,496 [ERROR] Delivery failed ./20240826T2236Z_MSC_Radar-Composite_MMHR_1km.tif
2024-08-26 22:46:32,497 [WARNING] sending again, attempt 2
2024-08-26 22:46:35,105 [ERROR] Unable to connect to the-destination (user:ec.gc.ca)
2024-08-26 22:46:35,105 [WARNING] sending again, attempt 3
2024-08-26 22:46:37,693 [ERROR] Unable to connect tothe-destination (user:ec.gc.ca)
2024-08-26 22:46:37,694 [INFO] appended to retry list file 20240826224343.792709827 http://ddsr-cmc-ops01.cmc.ec.gc.ca/ /20240826/MSC-RADAR/unique/GEOTIFF_6MIN/COMPOSITE/MMHR/20240826T2236Z_MSC_Radar-Composite_MMHR_1km.tif

On servers hosting V2.21.04, file transferring continued without any problem. However, on servers hosting V2.24.08 of v2, all following file transfers got appended to the retry queue.

for S in $(ls -d ddsr-cmc-ops0{1..8}*); do echo $S: && grep 'Unable to connect' $S/local/home/sarra/.cache/sarra/log/sr_sender_my-config*.log.2024-08-27 | wc -l ; done
ddsr-cmc-ops01: <-  V2.24.08
5080
ddsr-cmc-ops02: <-  V2.24.08
4173
ddsr-cmc-ops03: <-  V2.24.08
5581
ddsr-cmc-ops04: <-  V2.24.08
4864
ddsr-cmc-ops05: <-  V2.24.08
5299
ddsr-cmc-ops06: <-  V2.24.08
5187
ddsr-cmc-ops07: <-  V2.21.04
0
ddsr-cmc-ops08: <-  V2.21.04
0

The error message when debug is on

[DEBUG] sr_ftp nlst: ['20240826T2236Z_MSC_Radar-Composite_MMHR_1km.tif']
[DEBUG] sr_ftp ls_file_index
[ERROR] Unable to connect to the-destination (user: username)
[DEBUG] Exception details:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/sarra/sr_ftp.py", line 220, in connect
else: self.init_file_index()
File "/usr/lib/python3/dist-packages/sarra/sr_ftp.py", line 312, in init_file_index
self.ftp.retrlines('LIST', self.ls_file_index )
File "/usr/lib/python3.10/ftplib.py", line 465, in retrlines
line = fp.readline(self.maxline + 1)
File "/usr/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb7 in position 44: invalid start byte

Position 44 refers to the . character in 20240826T2236Z_MSC_Radar-Composite_MMHR_1km.tif

Corrective action

To prevent file transfers from further going to the retry queue with V2.24.08, you can remove the corrupted filename on the remote server and the file transferring should resume.

UPDATE : Removing the file did make the sender stop failing on every file transfer. However, new transfer attempts will fail whenever nlst returns a file with . in its filename.

I'm still not sure why this is a problem on v2.24.08

Permanent solution

Move the configuration to sr3

andreleblanc11 commented 2 weeks ago

This is the commit that introducts utf-8 encoding to FTP file transfers : https://github.com/MetPX/sarracenia/commit/42d9f09edad230839cefe9186ef7cfb9d73d9ccb

andreleblanc11 commented 2 weeks ago

This is definitely a V2.24.08 bug. I've checked our archived logs and saw that these errors have started spawning on our DDSR cluster every time a new node got patched 😭, even before we got a failed delivery on August 26th

petersilva commented 2 weeks ago

The commit above is to use utf-8 for filenames. This was probably done to allevate issues when file names have accented characters in them, and so eight bit chars show up... older systems would use iso-8859-1, but the charset would vary by regions, so it would have to be set per server. In the past ten years of so, utf-8 has taken over and become universal.

In this case, the complaint is about encoding a period character. Utf-8 for . is the same as iso8859 for . is the same as us-ascii for . It's unclear why using utf8 would break the presence of . in a file name...

petersilva commented 2 weeks ago

OK, I understand... utf8 is involved... it is actually character B7... "middle dot" which is not a normal period. Those have the same character ordinal in the charset, but the encoding would be different. This is also readin from an ls_index file. I suspect the ls_index file is being written in iso8859, and then read in utf8 causing utf8 to barf on the iso8859 encoding for middle dot.

petersilva commented 2 weeks ago

the actual fix should be something like writing all the ls_index files using utf-8 encoding.

petersilva commented 2 weeks ago

This probably affects any file transfer with an eight-bit name... (accented chars, etc...)

petersilva commented 2 weeks ago

weirdnesses... so it turns out that

This change is documented as being the result of https://datatracker.ietf.org/doc/html/rfc2640.html which changed the expected default.

latin-1 usually means 7bit ascii, so 8 bit characters should have been illegal before, and it should have complained...but just ignored the problem.

It looks like the remote server had a file with an invalid character (iso8859-1 encoded middle dot, rather than a normal period.) ...

checking that server, we see:

image

So I guess supplying an encoding option might help.

from https://docs.python.org/3/library/ftplib.html:

Changed in version 3.9: .. The encoding parameter was added, and the default was changed from Latin-1 to UTF-8 to follow [RFC 2640] (https://datatracker.ietf.org/doc/html/rfc2640.html).

petersilva commented 2 weeks ago

Reading over the SFTP RFC. it looks like they really, really, really want you to use UTF-8. I have never encountered a non utf-8 sftp server, so for now will continue to just assume all SFTP's are utf-8.

petersilva commented 2 weeks ago

ok, so:

so once PR is accepted, can close this.

petersilva commented 1 week ago

Actually, once fixed in sr3, we can label it v2only wontfix... and leave it open, I guess.