OpenFastPath / ofp

OpenFastPath project
BSD 3-Clause "New" or "Revised" License
349 stars 126 forks source link

If I run my application for the second time, i will always receive a SIGSEGV #288

Open LinArcX opened 9 months ago

LinArcX commented 9 months ago

I created a test application that has a structure like this:

 if (0 == odp_init_global(&instance, NULL, NULL)) {
    printf("odp_init_global: success!\n");

    if (0 == odp_init_local(instance, ODP_THREAD_CONTROL)) {
      printf("odp_init_local: success!\n");
      ofp_init_global_param(&app_init_params);
    }
    else {
      printf("Error: ODP local init failed.\n");
      odp_term_global(instance);
    }
  }
  else {
    printf("Error: ODP global init failed.\n");
  }

I put the above lines inside the constructor of my class FOO. and I do these things at the constructor:

    ofp_term_local();
    ofp_term_global();
    odp_term_local();
    if (m_instance) {
        odp_term_global(m_instance);
    }

When I ran my application for the first time, everything was ok. if I close my application and try to rerun it, it will crash and this is the backtrace output in gdb:

(gdb) bt
#0  0xffffcd4c in ?? ()
#1  0xf798afd2 in start_thread () from /lib/i386-linux-gnu/libpthread.so.0
#2  0xf75fd306 in clone () from /lib/i386-linux-gnu/libc.so.6

Steps before I run my application:

  1. I set hugepage number: sudo echo 512 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
  2. I mount hugepages: sudo mount -t hugetlbfs pagesize=1GB /mnt/huge

I was also thinking that maybe after first run, some processes or FD or something else left on my system. but there is no process left as I saw at htop.

Also, I tried to remove everything in these directories:

      sudo rm -r /mnt/huge/0/
      sudo rm -r /dev/shm/0/

But still receive SIGSEGV on the second run. Did I miss some steps in the cleanup process?

It's worth mentioning that I'm developing my application inside wsl(Debian Buster). maybe it causes the issue? is there any restriction on wsl?

bogdanPricope commented 9 months ago

I have no idea about wsl....

Else, what's with:

if (m_instance) {
    odp_term_global(m_instance);
  }

The instance is a number, actually a pid (getpid())

typedef uint64_t odp_instance_t;

LinArcX commented 9 months ago

the m_instance part is not my problem. as I told you for the first time, everything is OK.

My concern is this part:

I was also thinking that maybe after first run, some processes or FD or something else left on my system. but there is no process left as I saw at htop.

I want to know if there are any leftovers left on the system after an application finishes.

JereLeppanen commented 9 months ago

Are you able to run any of the examples or tests for a second time?

  1. I set hugepage number: sudo echo 512 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
  2. I mount hugepages: sudo mount -t hugetlbfs pagesize=1GB /mnt/huge

Maybe this is not related to the problem at hand, but you are reserving 2M pages, but mounting 1G pages.


From: LinArcX @.***> Sent: Tuesday, September 12, 2023 19:54 To: OpenFastPath/ofp Cc: Subscribed Subject: [OpenFastPath/ofp] If I run my application for the second time, i will always receive a SIGSEGV (Issue #288)

I created a test application that has a structure like this:

if (0 == odp_init_global(&instance, NULL, NULL)) { printf("odp_init_global: success!\n");

if (0 == odp_init_local(instance, ODP_THREAD_CONTROL)) {
  printf("odp_init_local: success!\n");
  ofp_init_global_param(&app_init_params);
}
else {
  printf("Error: ODP local init failed.\n");
  odp_term_global(instance);
}

} else { printf("Error: ODP global init failed.\n"); }

I put the above lines inside the constructor of my class FOO. and I do these things at the constructor:

ofp_term_local(); ofp_term_global(); odp_term_local(); if (m_instance) { odp_term_global(m_instance); }

When I ran my application for the first time, everything was ok. if I close my application and try to rerun it, it will crash and this is the backtrace output in gdb:

(gdb) bt

0 0xffffcd4c in ?? ()

1 0xf798afd2 in start_thread () from /lib/i386-linux-gnu/libpthread.so.0

2 0xf75fd306 in clone () from /lib/i386-linux-gnu/libc.so.6

Steps before I run my application:

  1. I set hugepage number: sudo echo 512 > /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

    1. I mount hugepages: sudo mount -t hugetlbfs pagesize=1GB /mnt/huge

I was also thinking that maybe after first run, some processes or FD or something else left on my system. but there is no process left as I saw at htop.

Also, I tried to remove everything in these directories:

  sudo rm -r /mnt/huge/0/
  sudo rm -r /dev/shm/0/

But still receive SIGSEGV on the second run. Did I miss some steps in the cleanup process?

— Reply to this email directly, view it on GitHubhttps://github.com/OpenFastPath/ofp/issues/288, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFSGF6QYYQBR543XYIP2QXLX2CHS5ANCNFSM6AAAAAA4VEJVDI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

LinArcX commented 9 months ago

Are you able to run any of the examples or tests for a second time?

I couldn't even run webserver2 for the first time.

Maybe this is not related to the problem at hand, but you are reserving 2M pages, but mounting 1G pages.

Do you mean I should put a number here: /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages ?

Actually, I put 11 there and I still have this issue.

JereLeppanen commented 9 months ago

I couldn't even run webserver2 for the first time.

Some of the examples, including webserver2, appear to have bugs in thread creation. Thank you for reporting that.

How about for example test/cunit/ofp_test_init?

Do you mean I should put a number here: /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages ?

Might be better to use the default 2M page size. I usually do it like this:

echo 1000 > /proc/sys/vm/nr_hugepages mkdir -p /mnt/huge mount -t hugetlbfs nodev /mnt/huge


From: LinArcX @.***> Sent: Wednesday, September 13, 2023 12:33 To: OpenFastPath/ofp Cc: Jere Leppanen (Nokia); Comment Subject: Re: [OpenFastPath/ofp] If I run my application for the second time, i will always receive a SIGSEGV (Issue #288)

Are you able to run any of the examples or tests for a second time?

I couldn't even run webserver2 for the first time.

Maybe this is not related to the problem at hand, but you are reserving 2M pages, but mounting 1G pages. Do you mean I should put a number here: /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages ?

Actually, I put 11 there and I still have this issue.

— Reply to this email directly, view it on GitHubhttps://github.com/OpenFastPath/ofp/issues/288#issuecomment-1717282230, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFSGF6RW2UACHXYNTJ5B6ATX2F4U7ANCNFSM6AAAAAA4VEJVDI. You are receiving this because you commented.Message ID: @.***>

LinArcX commented 9 months ago

Just found something maybe related to my problem. after I called: odp_init_global(), odp_init_local(), odph_thread_create() I never called: odph_thread_join()

My application still continue to run until in some parts of my code, it tries to create another thread like this:

    pthread_attr_t tattr;
    if (pthread_attr_init(&tattr)) {
        throw Exception("Awww");
    }

and after: pthread_attr_init(&tattr) call, application will crash.

Maybe this is the reason of crash?

bogdanPricope commented 9 months ago

I have tried 'webserver2' with @JereLeppanen 's fix and more changes but I don't reproduce this error on my "Ubuntu 22.04 LTS". ofp_term_global(); odp_term_local(); odp_term_global(instance); and 'linux_sigaction', etc.

Maybe is related to wsl or application.

On, pthread_attr_init(), remember to pthread_attr_destroy() and not to pthread_attr_init() twice on the same object.

Else, out of curiosity, what exactly are you trying to do (what use case)?

You may also have a look at this more advanced implementation: https://github.com/NetInoSoftware/nfp

LinArcX commented 9 months ago

I don't talk about webserver2. my use case is very simple. I have a huge application that I try to integrate ofp, odp into it. I ran into crashes in certain cases. Let me clarify the flow of the crash:

At the beginning of the application before doing anything i setup ofp/odp like this:

odp_init_global(); odp_init_local(); ofp_init_global() ofp_init_local() odph_thread_create() ... ...

... ... ...

my applications start to run and continue ... ... ... as I told in, somewhere in our application we call pthread_attr_init() like this:

pthread_attr_t tattr;
if (pthread_attr_init(&tattr)) {
throw Exception("Awww");
}

And exactly this place is where my application will crash.

My questions are clear:

  1. should i call odph_thread_join() after odph_thread_create()?
  2. what is the relation of my crash and pthread_attr_init()? (i want to know internally why this happens. why ofp and odp cause this problem? since if i remove ofp and odp from my application i never see this crash.)
JannePeltonen commented 9 months ago

Hi,

Are you not calling ofp_init_local() in every ofp thread at start? If not, anything can happen.

should i call odph_thread_join() after odph_thread_create()? It depends on what you are trying to do. If you do not want to wait that a thread exits, then you should not call it.

what is the relation of my crash and pthread_attr_init()?

I have no idea.

why ofp and odp cause this problem?

Maybe the problem is not caused by ofp and odp but by your code that uses them somehow incorrectly?

                       Janne

From: LinArcX @.> Sent: Friday, September 15, 2023 10:30 AM To: OpenFastPath/ofp @.> Cc: Subscribed @.***> Subject: Re: [OpenFastPath/ofp] If I run my application for the second time, i will always receive a SIGSEGV (Issue #288)

I don't talk about webserver2. my use case is very simple. I have a huge application that I try to integrate ofp, odp into it. I ran into crashes in certain cases. Let me clarify the flow of the crash:

At the beginning of the application before doing anything i setup ofp/odp like this:

odp_init_global(); odp_init_local(); ofp_init_global() ofp_init_local() odph_thread_create() ... ...

my applications start to run and continue ... ... ... as I told in, somewhere in our application we call pthread_attr_init() like this:

pthread_attr_t tattr;

if (pthread_attr_init(&tattr)) {

throw Exception("Awww");

}

And exactly this place is where my application will crash.

My questions are clear:

  1. should i call odph_thread_join() after odph_thread_create()?
  2. what is the relation of my crash and pthread_attr_init()? (i want to know internally why this happens. why ofp and odp cause this problem? since if i remove ofp and odp from my application i never see this crash.)

— Reply to this email directly, view it on GitHubhttps://github.com/OpenFastPath/ofp/issues/288#issuecomment-1720809135, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHAYVO2SURNSNR7V5VHWJT3X2P7V5ANCNFSM6AAAAAA4VEJVDI. You are receiving this because you are subscribed to this thread.Message ID: @.***>

JannePeltonen commented 9 months ago

Are you not calling ofp_init_local() in every ofp thread at start? If not, anything can happen.

LinArcX commented 9 months ago

how can i know how many ofp thread i have?

JannePeltonen commented 9 months ago

With "ofp thread" I meant a thread that you create and in which you call ofp. IOW, if you create a thread and intend to call ofp functions in it, the first ofp function you call in the thread must be ofp_init_local().

LinArcX commented 9 months ago

Oh, that's so hard.

What about setting this parameters: thr_params.start = default_event_dispatcher;? it won't do the same thing?

bogdanPricope commented 9 months ago

'default_event_dispatcher' calls ofp_init_local() indeed. The question is: do you have other threads that are using ofp API ? Those threads should be created with odph_thread_create() and should call ofp_init_local() at the beginning.

Note: odph_thread_create() calls underneath odp_init_local() and odp_term_local(). This is why you should use this API for those threads.

LinArcX commented 9 months ago

@bogdanPricope aweaome tips. thank you. just one thing. i'm using process instead of pthreads for thread_model: https://github.com/OpenDataPlane/odp/blob/master/helper/include/odp/helper/threads.h#L166

still i should follow your approach? i mean i should call odph_thread_create() and ofp_init_local() at the beginning of each thread?

bogdanPricope commented 9 months ago

For processes you should use odph_linux_process_fork() (or odph_linux_process_fork_n()). It calls underneath odp_init_local() on child process. That means, you should call ofp_init_local() when child process starts and ofp_term_local() / odp_term_local() when child ends.

LinArcX commented 9 months ago

I am using the same thr_common.instance that I used for odp_init_global(), for my other thread also. but I get this error:

[New Thread 0xd07fdb40 (LWP 21150)]
E 21 4136856064 thread.cpp:123] SUCCESS: odph_thread_create()
ERR: odp_init.c:611:odp_init_local(): Bad instance.
threads.c:56:run_thread(): Local init failed
[Thread 0xd17ffb40 (LWP 21148) exited]
ERR: odp_init.c:611:odp_init_local(): Bad instance.
threads.c:56:run_thread(): Local init failed
[Thread 0xd0ffeb40 (LWP 21149) exited]
[New Thread 0xcfbffb40 (LWP 21151)]
[New Thread 0xcfbffb40 (LWP 21151)]
E 27 4136856064 thread.cpp:123] SUCCESS: odph_thread_create()

And this is my code for another thread:

    if (0 == ofp_init_local()) {
        odph_thread_t thread_tbl[MAX_WORKERS];
        odph_thread_param_t thr_params;
        odph_thread_common_param_t thr_common;
        memset(thread_tbl, 0, sizeof(thread_tbl));
        /* Start dataplane dispatcher worker threads */
        odph_thread_param_init(&thr_params);
        thr_params.start = default_event_dispatcher;
        thr_params.arg = (void*)ofp_eth_vlan_processing;
        thr_params.thr_type = ODP_THREAD_WORKER;
        odph_thread_common_param_init(&thr_common);
        thr_common.instance = MySingletoonClass::getInstance()->ofpInstance();
        thr_common.cpumask = &cpumask;
        thr_common.share_param = 1;

        if (num_workers == odph_thread_create(thread_tbl, &thr_common, &thr_params, num_workers)) {
            OFP_ERR("SUCCESS: odph_thread_create() .\n");

        }
        else {
            OFP_ERR("Error: odph_thread_create() failed.\n");
        }
    }

This is my code for the main thread(beginning of my application):

    if (0 == odp_init_global(&m_instance, NULL, NULL)) {
        if (0 == odp_init_local(m_instance, ODP_THREAD_CONTROL)) {
            ofp_global_param_t app_init_params;
            ofp_init_global_param(&app_init_params);
            int num_workers = 1;
            char cpumaskstr[64];
            odp_cpumask_t cpumask;
            num_workers = odp_cpumask_default_worker(&cpumask, num_workers);
            if (odp_cpumask_to_str(&cpumask, cpumaskstr, sizeof(cpumaskstr)) < 0) {
                OFP_ERR("Error: Too small buffer provided to odp_cpumask_to_str");
            }
            OFP_INFO("Num worker threads: %i", num_workers);
            OFP_INFO("First CPU:          %i", odp_cpumask_first(&cpumask));
            OFP_INFO("CPU mask:           %s", cpumaskstr);

            char interface[25];
            strncpy(interface, "eth0", sizeof(interface)-1);

            char* interfaces[] = {interface};
            app_init_params.if_count = 1;
            app_init_params.if_names = interfaces;

            if (app_init_params.pktin_mode != ODP_PKTIN_MODE_SCHED) {
                app_init_params.pktin_mode = ODP_PKTIN_MODE_SCHED;
            }
            switch (app_init_params.sched_sync) {
            case ODP_SCHED_SYNC_PARALLEL:
                OFP_WARN("Warning: Packet order is not preserved with parallel RX queues\n");
                break;
            case ODP_SCHED_SYNC_ATOMIC:
                break;
            case ODP_SCHED_SYNC_ORDERED:
                if (app_init_params.pktout_mode != ODP_PKTOUT_MODE_QUEUE) {
                    OFP_WARN("Warning: Packet order is not preserved with ordered RX queues and direct TX queues.\n");
                }
                break;
            default:
                OFP_WARN("Warning: Unknown scheduling synchronization mode. Forcing atomic mode.\n");
                app_init_params.sched_sync = ODP_SCHED_SYNC_ATOMIC;
                break;
            }

            app_init_params.pkt_hook[OFP_HOOK_LOCAL] = fastpath_local_hook;

            if (0 == ofp_init_global(m_instance, &app_init_params)) {
                if (0 == ofp_init_local()) {
                    odph_thread_t thread_tbl[MAX_WORKERS];
                    odph_thread_param_t thr_params;
                    odph_thread_common_param_t thr_common;
                    memset(thread_tbl, 0, sizeof(thread_tbl));
                    /* Start dataplane dispatcher worker threads */
                    odph_thread_param_init(&thr_params);
                    thr_params.start = default_event_dispatcher;
                    thr_params.arg = (void*)ofp_eth_vlan_processing;
                    thr_params.thr_type = ODP_THREAD_WORKER;
                    odph_thread_common_param_init(&thr_common);
                    thr_common.instance = m_instance;
                    thr_common.cpumask = &cpumask;
                    thr_common.share_param = 1;
                    //thr_common.sync = 1;
                    thr_common.thread_model = 1;

                    if (num_workers == odph_thread_create(thread_tbl, &thr_common, &thr_params, num_workers)) {
                                           // some internal process
                    }
                    else {
                        OFP_ERR("Error: odph_thread_create() failed.\n");
                    }
                }
                else {
                    OFP_ERR("Error: OFP local init failed.");
                }
            }
            else {
                OFP_ERR("Error: OFP global init failed.");
            }
        }
        else {
            OFP_ERR("Error: ODP local init failed.");
        }
    }
    else {
        OFP_ERR("Error: ODP global init failed.");
    }
bogdanPricope commented 9 months ago

Recap:

Question: How is started this 'another thread'? Is it a thread or a process and if is a process when it was forked and with what API?

`int odp_init_local(odp_instance_t instance, odp_thread_type_t thr_type) { enum init_stage stage = NO_INIT;

if (instance != (odp_instance_t)odp_global_ro.main_pid) {
    ODP_ERR("Bad instance.\n");
    goto init_fail;
}

....... ` Either 'odp_global_ro.main_pid' is not initialized or 'instance' is invalid You may try to print that MySingletoonClass::getInstance()->ofpInstance() (cast it to pid_t or int) and see if it is valid.