erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.3k stars 2.94k forks source link

Long running SFTP transfers with :ssh_sftp.read/4 fail #8724

Open nsweeting opened 1 month ago

nsweeting commented 1 month ago

Describe the bug Long running SFTP transfers using :ssh_sftp.read/4 seem to consistently fail at some point. Failure comes in the form of :ssh_sftp.read/4 getting stuck as a result of an :infinity timeout. The overall task wrapping the transfer eventually times out after x minutes of no data movement.

To Reproduce Unfortunately this is a bit difficult. We were more or less able to reproduce - it just takes a long time. Essentially executing a long running SFTP transfer using :ssh_sftp.read/4 with a throttled download speed (500-600 kb/s range). After about 6-7GB of transfer - the read function seems to get "stuck" with no data movement.

Expected behavior Long running SFTP transfers using :ssh_sftp.read/4 should complete.

Affected versions erlang-26.2.5.1

Additional context We run a service that is responsible for moving data from some SFTP location to our internal network. We move thousands of files a day. In this specific context - these servers are hosted by Salesforce. Download speeds are typically throttled to be in the 500-600 kb/s range. We can have many of these transfers running at the same time for the same server. We normally have no issue.

We recently upgraded the base docker image we use from hexpm/elixir:1.16.0-erlang-26.2.1-alpine-3.18.4 to hexpm/elixir:1.17.1-erlang-26.2.5.1-alpine-3.20.1. After this upgrade we had consistent failures for long running transfers. This would be for files in in 15GB range. They seemed to consistently fail in the 6-7GB range. We had days of these kinds of failures accumulate - so it actually seemed to be fairly reproducible - although it takes a long time! As soon as we switched back to hexpm/elixir:1.16.0-erlang-26.2.1-alpine-3.18.4 - all transfer jobs succeeded. Shorter transfers seem to have no issue.

Its difficult to know specifically whether this is an issue introduced from the OTP upgrade - but at this point - it seems related. There were a couple updates to the :ssh module within this upgrade range.

IngelaAndin commented 1 month ago

Spontaneously this sounds like a ssh window_adjustment problem such fixed for a different scenario described here #7483. Our ssh expert is on vacation right now but he will be back soon an look into this.