citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.59k stars 670 forks source link

Investigate sockets being in TIME_WAIT with several parallel COPYs #617

Open samay-sharma opened 8 years ago

samay-sharma commented 8 years ago

When we run several parallel short COPYs on hash distributed tables, we saw errors which said that ports were not available for establishing connections for COPY. Note that the number of connections were still lower than the max_connections parameter on the worker nodes.

This is likely because of many sockets being in TIME_WAIT. We enabled tcp_tw_reuse and tcp_tw_recycle, and set tcp_fin_timeout to 30 but that still didn't resolve the issue.

We should investigate further to understand the cause of this.

@anarazel : Please feel free to add anything I may have missed.

ozgune commented 8 years ago

@samay-sharma / @anarazel -- I had two quick questions.

Do we know how long ports stay in the TIME_WAIT state (by default)? Also, how many ports do we have available to allocate from? From that, we can roughly calculate the number of new connections Citus can open per second in a sustained manner.

samay-sharma commented 7 years ago

Another user brought up that they also could easily reproduce running out of ports with COPY.