actix / actix-web

Actix Web is a powerful, pragmatic, and extremely fast web framework for Rust.
https://actix.rs
Apache License 2.0
21.59k stars 1.67k forks source link

StaticFile & memory leak #426

Closed neowutran closed 6 years ago

neowutran commented 6 years ago

Hi, I am playing a bit with rust/actix-web on my personal website. The code is very basic https://git.neowutran.ovh:8080/neowutran/Website/src/master/src/main.rs

  server::new(|| {
    App::new()
     // .middleware(middleware::Logger::default())
      .resource("/store_stats", |r| {
        r.method(Method::POST)
          .with_config(store_statistics, |config| {
            config.limit(MAX_SIZE_STATS);
          })
      })
    .resource("/store_packets", |r| {
      r.method(Method::POST)
        .with_config(store_packets, |config| {
            config.0.limit(MAX_SIZE_PACKET);
        })
    })
    .handler("/updates",StaticFiles::new("/home/gl0/updates/").unwrap().show_files_listing())
    .handler("/others",StaticFiles::new("/home/gl0/others/").unwrap().show_files_listing())
    .handler("/teraDB",StaticFiles::new("/home/gl0/teraDB/").unwrap().show_files_listing())
    .handler("/",StaticFiles::new("/home/http/neowutran/").unwrap().show_files_listing().index_file("index.html"))
  })

I run 2 instance of this exact same code. One on port 443, one on port 8083

On port 8083, only /store_packets and /store_stats are used. 0.2% of server memory, stable. Actively used ( ~thousands request by hours, gigabytes uploaded ). Ok.

On port 443, only the StaticFiles handler are used: /updates, /others, /teraDB, / ( TLS + HTTP2 ) Very actively used (~15k user use the /updates to download things). When I launched it yesterday, it used 0.2% of server memory. Memory used increased over time. Now, after ~15h, it now use 10% of server memory. With the command netstat -ano, I see that a lot of connections (hundreds) are "ESTABLISHED" on the service. Maybe in the case of HTTP2 / TLS, some issues with connection not being released or something like that ? ( For the same thing with apache httpd, ~0.2% of server memory, stable )

Thanks

DoumanAsh commented 6 years ago

So the issue happens with TLS? Out of curiosity did you try to disable HTTP2? Also could you try to play with keep alive settings to see if changing defaults would affect it?

neowutran commented 6 years ago

I just disabled HTTP2 with .no_http2() and relaunched Will send you the result in 1 or 2 days

neowutran commented 6 years ago

memory usage keep increasing quickly & constantly. netstat -ano | grep "ESTABLISHED off" | grep :443 show more and more things ESTABLISHED off (0.00/0/0)

Trying with a socket keep alive setting

fafhrd91 commented 6 years ago

"ESTABLISHED" means tcp connection is alive on kernel level, in that case actix keeps it open. that might be a reason for memory consumption. Try to change keep-alive settings https://actix.rs/actix-web/actix_web/server/struct.HttpServer.html#method.keep_alive change it to KeepAlive::Timeout()

I'll try to reproduce

neowutran commented 6 years ago

Tryied .keep_alive(None);

For the StaticFiles on port 443: Still have connection that are not closed, but less & memory usage grow slower ( after 2-3 hours: ~16 connection not correctly closed. After ~13 hours: ~40 connection not correcly closed. Memory usage started at 0.2% , and increased over time, after ~13 hours it is 2% )

For /store_stats and /store_packetson port 8083, memory usage is stable ( 0.2% didn't changed since the beginning 13 hours ago ). But ~20 connection didn't get closed correctly

(The code on port 443 & port 8083 is exactly the same. The access to port 443 & 8083 are mostly automatically done by softwares. Port 8083 is only used for /store_stats and /store_packets. Port 443 is only used for StaticFile handler)

So I think there are 2 different things:

Trying with the timeout

neowutran commented 6 years ago

Tryied with .keep_alive(server::KeepAlive::Timeout(5)) for ~13 hours Connections are closed correctly. However, on port 443 ( StaticFiles handler ), memory usage keep increasing constantly. At the beginning: 0.2% of server memory After ~13 hours: 6% of server memory

So I think there is also an issue with the StaticFile handler

DoumanAsh commented 6 years ago

@neowutran Hmmm so does KeepAlive ensures that no opened connections remain after timeout? but memory usage still increases on TLS port?

This is strange as StaticFiles would increase memory usage in this case on non-TLS port too then

neowutran commented 6 years ago

"Hmmm so does KeepAlive ensures that no opened connections remain after timeout? but memory usage still increases on TLS port?" Yes, with KeepAlive::Timeout(5) no opened connections remain after timeout. With the default settings, connections are not closed correctly. With keep_alive(None), connection are not closed correctly. With KeepAlive::Timeout(5) it is OK, no opened connection remains after timeout, everything is closed correctly.

Both port 443 & 8083 use TLS, it is the exact same code I launch it like ./neowutran 443 ./neowutran 8083

The only thing that is different is how they are used. Port 443 is used by web browser & autoupdate for a software. Only the StaticFiles handler are used. Port 8083 is used by a software to send statistics / logs. Only .resource("/store_packets" and .resource("/store_stats" are used

neowutran commented 6 years ago

Switched from default allocator to system allocator, recompiled & launched on port 443 with valgrind valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes ./target/release/neowutran 443. I let it run for one day, will see

fafhrd91 commented 6 years ago

could you also check if memory leaks with rust-tls feature?

neowutran commented 6 years ago

Ok, I will try that tomorrow

neowutran commented 6 years ago

Stopped it at ~30% of server memory https://neowutran.ovh:8083/valgrind.err

==7927== 98,020,578 bytes in 9 blocks are still reachable in loss record 420 of 421
==7927==    at 0x4C2F0FF: realloc (vg_replace_malloc.c:785)
==7927==    by 0x3EC27D: <alloc::raw_vec::RawVec<T, A>>::reserve (in /root/Website/target/release/neowutran)
==7927==    by 0x3EBE68: bytes::bytes::Inner::reserve (in /root/Website/target/release/neowutran)
==7927==    by 0x3EB5B5: bytes::bytes::BytesMut::extend_from_slice (in /root/Website/target/release/neowutran)
==7927==    by 0x262D72: <actix_web::server::output::TransferEncoding as std::io::Write>::write (in /root/Website/target/release/neowutran)
==7927==    by 0x261F57: <brotli2::write::BrotliEncoder<W>>::dump (in /root/Website/target/release/neowutran)
==7927==    by 0x26155A: std::io::Write::write_all (in /root/Website/target/release/neowutran)
==7927==    by 0x2742C8: actix_web::server::output::Output::write (in /root/Website/target/release/neowutran)
==7927==    by 0x191D85: <actix_web::server::h2writer::H2Writer<H> as actix_web::server::Writer>::write (in /root/Website/target/release/neowutran)
==7927==    by 0x1F5264: <actix_web::pipeline::Pipeline<S, H> as actix_web::server::HttpHandlerTask>::poll_io (in /root/Website/target/release/neowutran)
==7927==    by 0x1D9FD4: <actix_web::server::h2::Http2<T, H>>::poll (in /root/Website/target/release/neowutran)
==7927==    by 0x225AAD: <actix_web::server::channel::HttpChannel<T, H> as futures::future::Future>::poll (in /root/Website/target/release/neowutran)
==7927== 
==7927== 589,287,424 bytes in 9 blocks are still reachable in loss record 421 of 421
==7927==    at 0x4C2F0FF: realloc (vg_replace_malloc.c:785)
==7927==    by 0x3EC27D: <alloc::raw_vec::RawVec<T, A>>::reserve (in /root/Website/target/release/neowutran)
==7927==    by 0x3EBE68: bytes::bytes::Inner::reserve (in /root/Website/target/release/neowutran)
==7927==    by 0x3EB5B5: bytes::bytes::BytesMut::extend_from_slice (in /root/Website/target/release/neowutran)
==7927==    by 0x24EEAB: _ZN46_$LT$flate2..zio..Writer$LT$W$C$$u20$D$GT$$GT$4dump17h6d8703cbfffc310eE.llvm.7060922362214069525 (in /root/Website/target/release/neowutran)
==7927==    by 0x25068C: <flate2::zio::Writer<W, D> as std::io::Write>::write (in /root/Website/target/release/neowutran)
==7927==    by 0x263160: std::io::Write::write_all (in /root/Website/target/release/neowutran)
==7927==    by 0x27422E: actix_web::server::output::Output::write (in /root/Website/target/release/neowutran)
==7927==    by 0x191D85: <actix_web::server::h2writer::H2Writer<H> as actix_web::server::Writer>::write (in /root/Website/target/release/neowutran)
==7927==    by 0x1F5264: <actix_web::pipeline::Pipeline<S, H> as actix_web::server::HttpHandlerTask>::poll_io (in /root/Website/target/release/neowutran)
==7927==    by 0x1D9FD4: <actix_web::server::h2::Http2<T, H>>::poll (in /root/Website/target/release/neowutran)
==7927==    by 0x225AAD: <actix_web::server::channel::HttpChannel<T, H> as futures::future::Future>::poll (in /root/Website/target/release/neowutran)

Relaunching valgrind but this time using rust-tls, but I dont' expect a different result

fafhrd91 commented 6 years ago

could you disable brotli and flate2. or set default content encoding to identity

neowutran commented 6 years ago

I tryied to use rust-tls, it compile but crash at runtime

Port: 443
thread 'main' panicked at 'assertion failed: index < len', liballoc/vec.rs:882:9
stack backtrace:
   0:      0x81b5259138e - std::sys::unix::backtrace::tracing::imp::unwind_backtrace::h5553b940c3cc7b09
                               at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1:      0x81b52582c56 - std::sys_common::backtrace::print::hb8b973401c3c6a0a
                               at libstd/sys_common/backtrace.rs:71
                               at libstd/sys_common/backtrace.rs:59
   2:      0x81b5257f00d - std::panicking::default_hook::{{closure}}::h47bd662fc1b4a279
                               at libstd/panicking.rs:211
   3:      0x81b5257ed80 - std::panicking::default_hook::h015c8c74cbefa9d4
                               at libstd/panicking.rs:227
   4:      0x81b5257f67c - std::panicking::rust_panic_with_hook::haf5c3c0a8fd23b25
                               at libstd/panicking.rs:475
   5:      0x81b5257f279 - std::panicking::continue_panic_fmt::h73742afd3ffacebd
                               at libstd/panicking.rs:390
   6:      0x81b5257f175 - rust_begin_unwind
                               at libstd/panicking.rs:325
   7:      0x81b525d4bfb - core::panicking::panic_fmt::h64d514aa7d957863
                               at libcore/panicking.rs:77
   8:      0x81b525d4b2a - core::panicking::panic::hba5f8182eb1c4ea2
                               at libcore/panicking.rs:52
   9:      0x81b51d2496b - <alloc::vec::Vec<T>>::remove::h062a683694d95094
                               at /root/Website/<panic macros>:3
  10:      0x81b51e51911 - neowutran::main::haed1fead754a9bd6
                               at src/main.rs:195
  11:      0x81b51dc2a8f - std::rt::lang_start::{{closure}}::hef9d0eec35f06fe0
                               at /checkout/src/libstd/rt.rs:74
  12:      0x81b5257f112 - std::panicking::try::do_call::h8ed5eafe3e94e158
                               at libstd/rt.rs:59
                               at libstd/panicking.rs:310
  13:      0x81b5259fe99 - __rust_maybe_catch_panic
                               at libpanic_unwind/lib.rs:105
  14:      0x81b52581d25 - std::rt::lang_start_internal::ha4a5edbf3a495792
                               at libstd/panicking.rs:289
                               at libstd/panic.rs:392
                               at libstd/rt.rs:58
  15:      0x81b51dc2a67 - std::rt::lang_start::hf9307b0d4fca7d8b
                               at /checkout/src/libstd/rt.rs:74
  16:      0x81b51e51fd9 - main
  17:     0x68ec2c1d206a - __libc_start_main
  18:      0x81b51cf5b29 - _start
  19:                0x0 - <unknown>
   let mut config = ServerConfig::new(NoClientAuth::new());
   let cert_file = &mut BufReader::new(File::open("/etc/letsencrypt/live/neowutran.ovh/fullchain.pem").unwrap());
   let key_file = &mut BufReader::new(File::open("/etc/letsencrypt/live/neowutran.ovh/privkey.pem").unwrap());
   let cert_chain = certs(cert_file).unwrap();
   let mut keys = rsa_private_keys(key_file).unwrap();
   config.set_single_cert(cert_chain, keys.remove(0)).unwrap();

It crash on line

config.set_single_cert(cert_chain, keys.remove(0)).unwrap();

If you have some code example ^^


Disabled brotli and flate2 + relaunched with valgrind Sending the results later today / tomorrow

fafhrd91 commented 6 years ago

Seems keys.remove crashes, are you sure keys has values?

neowutran commented 6 years ago

my bad, used rsa_private_keys instead of pkcs8_private_keys

neowutran commented 6 years ago

No brotli, no flate2, with rustls

==12935== 215,040 bytes in 60 blocks are indirectly lost in loss record 85 of 86
==12935==    at 0x4C2CEDF: malloc (vg_replace_malloc.c:299)
==12935==    by 0x231557: <http::header::map::HeaderMap<T>>::with_capacity (in /root/Website/target/release/neowutran)
==12935==    by 0x1B1A3D: <std::thread::local::LocalKey<T>>::with (in /root/Website/target/release/neowutran)
==12935==    by 0x244C2D: <actix_web::fs::StaticFiles<S, C>>::try_handle (in /root/Website/target/release/neowutran)
==12935==    by 0x23E5B6: <actix_web::fs::StaticFiles<S, C> as actix_web::handler::Handler<S>>::handle (in /root/Website/target/release/neowutran)
==12935==    by 0x1AE670: <actix_web::handler::WrapHandler<S, H, R> as actix_web::handler::RouteHandler<S>>::handle (in /root/Website/target/release/neowutran)
==12935==    by 0x16E31F: <actix_web::router::Router<S>>::handle (in /root/Website/target/release/neowutran)
==12935==    by 0x207D61: <actix_web::pipeline::Pipeline<S, H>>::new (in /root/Website/target/release/neowutran)
==12935==    by 0x19518C: <actix_web::application::HttpApplication<S> as actix_web::server::HttpHandler>::handle (in /root/Website/target/release/neowutran)
==12935==    by 0x1D5A4A: <actix_web::server::h2::Entry<H>>::new (in /root/Website/target/release/neowutran)
==12935==    by 0x1DFBBA: <actix_web::server::h2::Http2<T, H>>::poll (in /root/Website/target/release/neowutran)
==12935==    by 0x22C626: <actix_web::server::channel::HttpChannel<T, H> as futures::future::Future>::poll (in /root/Website/target/release/neowutran)
==12935== 
==12935== 481,521 (512 direct, 481,009 indirect) bytes in 1 blocks are definitely lost in loss record 86 of 86
==12935==    at 0x4C2CEDF: malloc (vg_replace_malloc.c:299)
==12935==    by 0x251690: actix_web::server::message::Request::new (in /root/Website/target/release/neowutran)
==12935==    by 0x23A812: <actix_web::server::settings::WorkerSettings<H>>::get_request (in /root/Website/target/release/neowutran)
==12935==    by 0x1D5731: <actix_web::server::h2::Entry<H>>::new (in /root/Website/target/release/neowutran)
==12935==    by 0x1DFBBA: <actix_web::server::h2::Http2<T, H>>::poll (in /root/Website/target/release/neowutran)
==12935==    by 0x22C626: <actix_web::server::channel::HttpChannel<T, H> as futures::future::Future>::poll (in /root/Website/target/release/neowutran)
==12935==    by 0x2BD83F: futures::task_impl::std::set (in /root/Website/target/release/neowutran)
==12935==    by 0x2B9FF5: <futures::task_impl::Spawn<T>>::poll_future_notify (in /root/Website/target/release/neowutran)
==12935==    by 0x2D2492: tokio::executor::current_thread::CurrentRunner::set_spawn (in /root/Website/target/release/neowutran)
==12935==    by 0x2CB1FC: <tokio::executor::current_thread::scheduler::Scheduler<U>>::tick (in /root/Website/target/release/neowutran)
==12935==    by 0x2D264F: <tokio::executor::current_thread::Entered<'a, P>>::block_on (in /root/Website/target/release/neowutran)
==12935==    by 0x2D5289: <std::thread::local::LocalKey<T>>::with (in /root/Website/target/release/neowutran)
==12935== 
==12935== LEAK SUMMARY:
==12935==    definitely lost: 512 bytes in 1 blocks
==12935==    indirectly lost: 481,009 bytes in 583 blocks
==12935==      possibly lost: 6,048 bytes in 21 blocks
==12935==    still reachable: 175,156 bytes in 427 blocks
==12935==         suppressed: 0 bytes in 0 blocks

Full file at https://neowutran.ovh/valgrind2.err

(reached ~35% of server memory: around 2Go )

neowutran commented 6 years ago

Relaunched the service without valgrind, memory consumption is rising very quickly ( 0.2% to 3% in ~20 minutes. For comparaison, apache httpd is stable at around 0.1% - 0.2% )

screenshot

If it can help, the full code is here https://git.neowutran.ovh:8080/neowutran/Website/src/master/src/main.rs

neowutran commented 6 years ago

Re runned the same thing:

==18254== LEAK SUMMARY:
==18254==    definitely lost: 0 bytes in 0 blocks
==18254==    indirectly lost: 0 bytes in 0 blocks
==18254==      possibly lost: 107,232 bytes in 536 blocks
==18254==    still reachable: 6,705,296,665 bytes in 48,670 blocks
==18254==                       of which reachable via heuristic:
==18254==                         multipleinheritance: 2,144 bytes in 4 blocks
==18254==         suppressed: 0 bytes in 0 blocks

-> Reached 6Go of ram, then crashed by out of memory. Valgrind doesn't show anything really interesting - Expect the memory usage at crash time.

Maybe somes array that are never cleaned. The crash generated a core dump, will try to check it. Coredump: https://neowutran.ovh/vgcore.18254 Binary: https://neowutran.ovh/neowutran

DoumanAsh commented 6 years ago

Thanks for testing, I guess we do leak somewhere memory continuously, not just once or twice.

I need to check out your valgrind logs

neowutran commented 6 years ago

Definitive report is here https://neowutran.ovh:8080/valgrind4.err

==11337== 65,702,912 bytes in 422 blocks are still reachable in loss record 359 of 360
==11337==    at 0x4C2F0FF: realloc (vg_replace_malloc.c:785)
==11337==    by 0x161835: <alloc::raw_vec::RawVec<T, A>>::reserve (in /root/Website/target/release/neowutran)
==11337==    by 0x22CE0F: _ZN28_$LT$slab..Slab$LT$T$GT$$GT$6insert17h70243fcb4d25fc8bE.llvm.9927898001551859895 (in /root/Website/target/release/neowutran)
==11337==    by 0x1C2423: h2::proto::streams::prioritize::Prioritize::send_data (in /root/Website/target/release/neowutran)
==11337==    by 0x178FF7: h2::proto::streams::counts::Counts::transition (in /root/Website/target/release/neowutran)
==11337==    by 0x1E1306: <h2::proto::streams::streams::StreamRef<B>>::send_data (in /root/Website/target/release/neowutran)
==11337==    by 0x188511: <actix_web::server::h2writer::H2Writer<H> as actix_web::server::Writer>::poll_completed (in /root/Website/target/release/neowutran)
==11337==    by 0x1F7F2F: <actix_web::pipeline::ProcessResponse<S, H>>::poll_io (in /root/Website/target/release/neowutran)
==11337==    by 0x1EE9EC: <actix_web::pipeline::Pipeline<S, H> as actix_web::server::HttpHandlerTask>::poll_io (in /root/Website/target/release/neowutran)
==11337==    by 0x1DC476: <actix_web::server::h2::Http2<T, H>>::poll (in /root/Website/target/release/neowutran)
==11337==    by 0x229496: <actix_web::server::channel::HttpChannel<T, H> as futures::future::Future>::poll (in /root/Website/target/release/neowutran)
==11337==    by 0x2BE96F: futures::task_impl::std::set (in /root/Website/target/release/neowutran)
==11337== 
==11337== 4,562,506,484 bytes in 18,203 blocks are still reachable in loss record 360 of 360
==11337==    at 0x4C2F0FF: realloc (vg_replace_malloc.c:785)
==11337==    by 0x408CED: <alloc::raw_vec::RawVec<T, A>>::reserve (in /root/Website/target/release/neowutran)
==11337==    by 0x4088D8: bytes::bytes::Inner::reserve (in /root/Website/target/release/neowutran)
==11337==    by 0x408175: bytes::bytes::BytesMut::extend_from_slice (in /root/Website/target/release/neowutran)
==11337==    by 0x2584EC: _ZN9actix_web6server6output16TransferEncoding6encode17hb9e925f590804650E.llvm.13258577972807808252 (in /root/Website/target/release/neowutran)
==11337==    by 0x257893: actix_web::server::output::Output::write (in /root/Website/target/release/neowutran)
==11337==    by 0x18A045: <actix_web::server::h2writer::H2Writer<H> as actix_web::server::Writer>::write (in /root/Website/target/release/neowutran)
==11337==    by 0x1F6F84: <actix_web::pipeline::ProcessResponse<S, H>>::poll_io (in /root/Website/target/release/neowutran)
==11337==    by 0x1EE9EC: <actix_web::pipeline::Pipeline<S, H> as actix_web::server::HttpHandlerTask>::poll_io (in /root/Website/target/release/neowutran)
==11337==    by 0x1DC476: <actix_web::server::h2::Http2<T, H>>::poll (in /root/Website/target/release/neowutran)
==11337==    by 0x229496: <actix_web::server::channel::HttpChannel<T, H> as futures::future::Future>::poll (in /root/Website/target/release/neowutran)
==11337==    by 0x2BE96F: futures::task_impl::std::set (in /root/Website/target/release/neowutran)
==11337== 
==11337== LEAK SUMMARY:
==11337==    definitely lost: 0 bytes in 0 blocks
==11337==    indirectly lost: 0 bytes in 0 blocks
==11337==      possibly lost: 6,336 bytes in 22 blocks
==11337==    still reachable: 4,664,103,242 bytes in 59,899 blocks
==11337==                       of which reachable via heuristic:
==11337==                         multipleinheritance: 536 bytes in 1 blocks
==11337==         suppressed: 0 bytes in 0 blocks
==11337== 
DoumanAsh commented 6 years ago

@neowutran There is surprisingly lots of various errors that are detected by valgrind... Thank you for your time to collect it.

fafhrd91 commented 6 years ago

@DoumanAsh could you try to debug. I won’t have time until next week. Also on OS X memory consumption returns back to normal after test run

neowutran commented 6 years ago

If needed, I can reproduce the memory leak easily & quickly on my server ( Server os: archlinux, quite a lot of traffic )

Didn't really digged into the actix code yet, & still learning Rust. But from valgrind, it seems that this is extended, but not cleaned at the end https://github.com/actix/actix-web/blob/master/src/server/output.rs#L128

DoumanAsh commented 6 years ago

@neowutran I think it might be related to how BytesMut work as it operates on raw pointers and use some offset to actually re-create Vec. But it should be cleaned when BytesMut gets destructed...

@fafhrd91 I tried today to run static files without TLS, and there is definitely no problems according to leak checker. Memory increases a bit over time, but at some point just stops. I'll try the same with TLS later on

DoumanAsh commented 6 years ago

@neowutran on side note how did you run your tests? Did you just kill your server when stopped valgrind? It might be possible that your server never run destructors and that's why valgrind reports memory as still reachable

Can you try to re-run valgrind while gracefully closing your application using System::current().stop(); from some special endpoint?

neowutran commented 6 years ago

@DoumanAsh Before I just killed it with a SIGINT ( kill -2 $PID )


Added a route to gracefully close it https://git.neowutran.ovh:8080/neowutran/Website
Check it to be sure it is correctly done

Waited until memory consumption reached ~1.3 Go (only needed 15 minutes to reach that), then called https://neowutran.ovh/shutdown. The service stopped as expected

Valgrind result is here https://neowutran.ovh/valgrind5.err https://neowutran.ovh:8080/valgrind5.err

Valgrind command used: valgrind --leak-check=full --show-reachable=yes --show-leak-kinds=all ./target/release/neowutran 443 > /root/valgrind.log 2> /root/valgrind.err

DoumanAsh commented 6 years ago

@neowutran Thank you, I wanted to be sure that all destructors would run before looking into reachable memory. I'll try to reproduce it on my Windows machine and will dig into valgrind output too

neowutran commented 6 years ago

Did another test, updated the dependencies to the lastest on master ( didn't changed anything except cargo update & switching from bind_ssl to bind_with Didn't leak memory anymore (didn't looked at valgrind this time, but no huge memory consumption). But for some reason, all requests switched from HTTP2 (before the cargo update) to HTTP1 (after the cargo update )

However, after the cargo update ( and the automatic switch from HTTP2 to HTTP1) it created a new issue: #454

fafhrd91 commented 6 years ago

I think I found bug. I’ll try to fix it today, it is a bit tricky though.

fafhrd91 commented 6 years ago

@neowutran could you try master

neowutran commented 6 years ago

Memory usage keep increasing constantly for the moment. But much much slower than before

Will generate a new valgrind output

neowutran commented 6 years ago

https://neowutran.ovh:8080/valgrind6.err

Memory consumption still increase constantly, but much slower than before. ( runned ~10h before stopping it ) Same code as before, just updated to lastest change on master. Stopped the service by calling /shutdown

DoumanAsh commented 6 years ago

Thanks!

DoumanAsh commented 6 years ago

@neowutran Just letting you know, that so far I haven't found any lead on memory leak ( I don't have enough time these days) but I have't forgotten abut it 😄

Neopallium commented 6 years ago

FYI, I have found memory leaking with the simple "hello-world" from the examples. It seems to be caused when the BytesMut buffer is resized (bytes::bytes::Inner::reserve). I have only been testing with http1.1. @neowutran 's valgrind output is also showing leaks in (bytes::bytes::Inner::reserve).

When many keep-alive requests are received on the same connection, the buffer keep growing.

An easy way to cause the memory leaking issue is to send requests with many headers. I ran a http stress test utility with 20k of header data and 15,000 concurrent connections. That cause the simple hello-world app to use more then 1.2Gbytes of memory.

I have been trying to track down how the BytesMut memory is leaking. I am new to Rust, so it hasn't been easy. It might be a problem in the "bytes" crate, but I haven't been able to switch to a local copy of that crate (too many cross dependencies between tokio, actix, h2, etc...).

I have checked the HttpChannel, Http1 and RequestPool objects to make sure they are being freed. So I haven't found a live reference to the BytesMut buffer inside actix-web, which is why I think it might be a problem with how BytesMut is tracking it's internal bytes array references.

Also it would be nice if each connection didn't hold on to an empty 8k buffer after the request has been parsed, it would be better to return it to a pool like what is done with the Request objects.

neowutran commented 6 years ago

what tool did you used for the http stress test?

Neopallium commented 6 years ago

I used github.com/wg/wrk It has support for custom scripts written in Lua.

Here is the script I created (scripts/large_request.lua):

-- Generate about 20k of junk headers.
local h = wrk.headers
local data = ('a'):rep(1024) -- 1k of 'a'
for i = 1, 20 do
  h['Data-' .. i] = data
end

Run with:

./wrk -t 2 -c 15000 -d 60 -s ./scripts/large_request.lua http://localhost:8080/
neowutran commented 6 years ago

Thanks :)

And another way, without large header https://github.com/tsenart/vegeta

echo "GET https://{remote server}/{bigfile}" | vegeta attack 

In ~10 second, server memory consumption reach few gigabytes, doesn't get cleaned after the end of the stress test

Also work on localhost, but less impressive

wrk -c 10000 -t 4 -d 200 http://localhost:8080/{bigfile}
Neopallium commented 6 years ago

FYI, I was wrong earlier about BytesMut causing the leak. I just tracked the usage of HttpChannel (actix-web/src/server/channel.rs) & Http1 (actix-web/src/server/h1.rs) objects by adding print statements and a Drop trait. It seems that some connect disconnects are not causing the HttpChannel and Http1 objects to be freed.

Neopallium commented 6 years ago

found the problem: https://github.com/actix/actix-web/blob/d39dcc58cde683efbe0482ebf406fa227d472c24/src/server/mod.rs#L403

If the client closes the connection right after send some data (another request), then the read_available() method will not notify the call code of the disconnect (stream.read returning 0).

This patch might not be the correct fix, since it will ignore the data in the buffer. But it does fix the memory leak.

diff --git a/src/server/mod.rs b/src/server/mod.rs
index 36d85a7..f9194de 100644
--- a/src/server/mod.rs
+++ b/src/server/mod.rs
@@ -400,7 +400,7 @@ pub trait IoStream: AsyncRead + AsyncWrite + 'static {
                 match self.read(buf.bytes_mut()) {
                     Ok(n) => {
                         if n == 0 {
-                            return Ok(Async::Ready(!read_some));
+                            return Ok(Async::Ready(true));
                         } else {
                             read_some = true;
                             buf.advance_mut(n);
neowutran commented 6 years ago

This modification fixed the leak on your side ?

Tested the modification against

echo "GET https://{remote server}/{bigfile}" | vegeta attack 

still leaking a lot

Neopallium commented 6 years ago

Are you testing http2 or http1 protocol? I haven't looked into the http2 code yet and haven't done any testing with http2.

I am working on a better fix right now. A HTTP client can close it's side of the TCP socket right after sending a request, the HTTP server should still process the request and send the response before closing the server-side of the TCP socket.

Neopallium commented 6 years ago

FYI, I have created a fork https://github.com/Neopallium/actix-web where I will post an improved fix.

Neopallium commented 6 years ago

Also make sure the TCP connections have been fully closed.

lsof -p `pgrep server-bin-name` | grep TCP
Neopallium commented 6 years ago

I pushed a better fix c3ae9997fc126988e7aec80453886b824b9d7b36 that allows the received http requests to be processed before the server-side is closed.

@neowutran The memory leaks you are seeing from:

echo "GET https://{remote server}/{bigfile}" | vegeta attack 

might be from the http response buffers growing to hold the '{bigfile}'. Can you send valgrind output?

Neopallium commented 6 years ago

FYI, here is a pull-request for the disconnect fix: https://github.com/actix/actix-web/pull/483

@neowutran , I looked at your last valgrind results (https://neowutran.ovh:8080/valgrind6.err) and it looks like there is some leaking (or growing buffers) when sending large HTTP responses.

After doing some testing here it seems that there is more memory leaking for both http1/http2 response writing. I don't know when I will have time to look into this more.

fafhrd91 commented 6 years ago

@neowutran could you try master. @Neopallium thanks! great work!

i will check memory leak this week.

neowutran commented 6 years ago

Can reproduce the leak on localhost But most of the ram used get cleaned by the destructors when the application exit, so valgrind output seems a bit less interesting

used this command to force the leak on localhost wrk -c 15000 -t 2 -d 200 http://localhost:8080/{bigfile}

DoumanAsh commented 6 years ago

Hm that's even more complex then... Then it seems we accumulate some objects... Could it be realted to sockets, but I'm not sure