FreeRTOS / FreeRTOS-Kernel

FreeRTOS kernel files only, submoduled into https://github.com/FreeRTOS/FreeRTOS and various other repos.
https://www.FreeRTOS.org
MIT License
2.76k stars 1.12k forks source link

[BUG] xTaskCreateStatic() crash due to prvAddNewTaskToReadyList() and lack of valid pxCurrentTCB #825

Closed cmorganBE closed 1 year ago

cmorganBE commented 1 year ago

Describe the bug prvAddNewTaskToReadyList() expects a valid pxCurrentTCB but this is null.

[New Thread 0x7ffff7fbc940 (LWP 697305)]
[New Thread 0x7ffff7fb4280 (LWP 697306)]

Thread 2 "system_tests" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff7fbe680 (LWP 697298)]
0x0000555555577349 in prvGetThreadFromTask (xTask=0x0) <-------------------------
    at /home/cmorgan/projects/cmd001_1/FreeRTOS/FreeRTOS/Source/portable/ThirdParty/GCC/Posix/port.c:92
92      StackType_t * pxTopOfStack = *( StackType_t ** ) xTask;
(gdb) bt
#0  0x0000555555577349 in prvGetThreadFromTask (xTask=0x0)
    at /home/cmorgan/projects/cmd001_1/FreeRTOS/FreeRTOS/Source/portable/ThirdParty/GCC/Posix/port.c:92
#1  0x00005555555777a1 in prvPortYieldFromISR ()
    at /home/cmorgan/projects/cmd001_1/FreeRTOS/FreeRTOS/Source/portable/ThirdParty/GCC/Posix/port.c:307
#2  0x00005555555777cd in vPortYield ()
    at /home/cmorgan/projects/cmd001_1/FreeRTOS/FreeRTOS/Source/portable/ThirdParty/GCC/Posix/port.c:317
#3  0x00005555555721a2 in prvAddNewTaskToReadyList (pxNewTCB=0x7ffff7facb80)
    at /home/cmorgan/projects/cmd001_1/FreeRTOS/FreeRTOS/Source/tasks.c:1785
#4  0x0000555555571d2d in xTaskCreateStatic (pxTaskCode=0x55555558f215 <State<true>::managementThread(void*)>, 
    pcName=0x55555569ccdf "State", ulStackDepth=4096, pvParameters=0x7ffff7facb70, uxPriority=5, 
    puxStackBuffer=0x7ffff7facc20, pxTaskBuffer=0x7ffff7facb80)
    at /home/cmorgan/projects/cmd001_1/FreeRTOS/FreeRTOS/Source/tasks.c:1217
#5  0x000055555558c3cc in State<true>::initialize (this=0x7ffff7facb70)
    at /home/cmorgan/projects/cmd001_1/system_tests/../main/State.h:136
#6  0x0000555555587d66 in Manager<true>::initialize (timeoutTimer=..., treatmentTimer=..., holdTimer=..., 
    reservoirPump=..., bladderValve=..., adc=..., reservoirPressureSensor=..., bladderPressureSensor=..., i2c=..., 
    ltc2943=..., pca9532=..., rtc=..., audio=..., gpioISRToTask=..., state=..., sensorData=...)
    at /home/cmorgan/projects/cmd001_1/system_tests/../main/Manager.h:223
#7  0x00005555555838b1 in SystemSim::initialize (this=0x7ffff7fac6d0)
    at /home/cmorgan/projects/cmd001_1/system_tests/SystemSim.h:50
#8  0x00005555555a7bcf in CATCH2_INTERNAL_TEST_0 ()
    at /home/cmorgan/projects/cmd001_1/system_tests/tests/test_405_bladder.cpp:11
#9  0x00005555556354d8 in Catch::(anonymous namespace)::TestInvokerAsFunction::invoke (this=0x5555557b89a0)
    at src/catch2/internal/catch_test_registry.cpp:58
#10 0x000055555561db45 in Catch::TestCaseHandle::invoke (this=0x7ffff0000cd0)
    at src/catch2/../catch2/catch_test_case_info.hpp:115
#11 0x000055555561c8bc in Catch::RunContext::invokeActiveTestCase (this=0x7ffff7fbd9d0)
    at src/catch2/internal/catch_run_context.cpp:545
#12 0x000055555561c5e0 in Catch::RunContext::runCurrentTest (this=0x7ffff7fbd9d0, redirectedCout="", 
    redirectedCerr="") at src/catch2/internal/catch_run_context.cpp:508
#13 0x000055555561ac10 in Catch::RunContext::runTest (this=0x7ffff7fbd9d0, testCase=...)
    at src/catch2/internal/catch_run_context.cpp:239
#14 0x00005555555c413d in Catch::(anonymous namespace)::TestGroup::execute (this=0x7ffff7fbd9c0)
    at src/catch2/catch_session.cpp:110
#15 0x00005555555c5622 in Catch::Session::runInternal (this=0x7ffff7fbdca0) at src/catch2/catch_session.cpp:332
#16 0x00005555555c512c in Catch::Session::run (this=0x7ffff7fbdca0) at src/catch2/catch_session.cpp:263
#17 0x000055555556d475 in Catch::Session::run<char> (this=0x7ffff7fbdca0, argc=1, argv=0x7fffffffe018)
    at /home/cmorgan/projects/cmd001_1/Catch2/src/catch2/../catch2/catch_session.hpp:41
#18 0x000055555556d3d0 in app_main (params=0x7fffffffdee0)
    at /home/cmorgan/projects/cmd001_1/system_tests/main.cpp:10
#19 0x000055555556d62d in app_main_wrapper (params=0x7fffffffdee0) at /home/cmorgan/projects/cmd001_1/sim/sim.c:13
#20 0x0000555555577a9d in prvWaitForStart (pvParams=0x7ffff7fbefe8)
    at /home/cmorgan/projects/cmd001_1/FreeRTOS/FreeRTOS/Source/portable/ThirdParty/GCC/Posix/port.c:477
#21 0x00007ffff788f6ba in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
--Type <RET> for more, q to quit, c to continue without paging--
#22 0x00007ffff791e0d0 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Target

Host

To Reproduce Call xTaskCreateStatic() from app_main().

Expected behavior xTaskCreateStatic() would work.

moninom1 commented 1 year ago

HI @cmorganBE , Thank you for reporting the bug.

Can you please help me know when you are calling xTaskCreateStatic from the app_main() are you passing pxTaskBuffer as NULL? That is why hitting the above issue?

From the details mentioned here, If neither puxStackBuffer or pxTaskBuffer are NULL then the task will be created, and the task's handle is returned. If either puxStackBuffer or pxTaskBuffer is NULL then the task will not be created and NULL will be returned.

Can you please give more details on how you are calling xTaskCreateStatic.

cmorganBE commented 1 year ago

@moninom1 let me modify the demo app to use the functions we are using and confirm it reproduces. If so I can push that to a fork to show an example, and if not it could help me figure out if its a usage error on our side. The same code is running on target hardware though... should have something by tomorrow

Skptak commented 1 year ago

@cmorganBE Hey, looking at your stack trace it looks like you're hitting an interrupt while creating that new task?

The first line of tasks.c:xTaskCreateStatic() is taskENTER_CRITICAL. This is used to stop interrupts from coming through:

    static void prvAddNewTaskToReadyList( TCB_t * pxNewTCB )
    {
        /* Ensure interrupts don't access the task lists while the lists are being
         * updated. */
        taskENTER_CRITICAL();

taskENTER_CRITICAL() then gets mapped to vPortEnterCritical

 void vPortEnterCritical( void )
{
    if( uxCriticalNesting == 0 )
    {
        vPortDisableInterrupts();
    }

    uxCriticalNesting++;
}

But looking at your posted stack trace it looks like your call to prvAddNewTaskToReadyList() is instead triggering a call to vPortYield()? Where performing this before getting to execute the code in prvAddNewTaskToReadyList() that would set pxCurrentTCB to be the new TCB could be the root of your problem

    static void prvAddNewTaskToReadyList( TCB_t * pxNewTCB )
    {
        /* Ensure interrupts don't access the task lists while the lists are being
         * updated. */
        taskENTER_CRITICAL();
        {
            uxCurrentNumberOfTasks++;

            if( pxCurrentTCB == NULL )
            {
                /* There are no other tasks, or all the other tasks are in
                 * the suspended state - make this the current task. */
                pxCurrentTCB = pxNewTCB;

Can you potentially step through your existing code in a debugger and just ensure that the correct order of functions is occuring?

cmorganBE commented 1 year ago

@Skptak note that I'm using the posix gcc port here. I'm hitting #826 when building here but once I'm able to build I'll adjust the demo to attempt to reproduce.

Looking through our code here I think this may be related to the system state being reused. We are using catch2 and each test case will create all threads and then they'd be shut down at the end of the test case. However, all of this is done from within a single app_main() that we call from a task we create from our simulation executable's main() like:

int main(int argc, char *argv[])
{
    COMMAND_LINE_ARGS command_line_args = {.argc = argc, .argv = argv};

   /* Do not include trace code when performing a code coverage analysis. */
    #if ( projCOVERAGE_TEST != 1 )
        {
            /* Initialise the trace recorder.  Use of the trace recorder is optional.
             * See http://www.FreeRTOS.org/trace for more information. */
            vTraceEnable( TRC_START );

            /* Start the trace recording - the recording is written to a file if
             * configASSERT() is called. */
            printf( "\r\nTrace started.\r\nThe trace will be dumped to disk if a call to configASSERT() fails.\r\n" );

            #if ( TRACE_ON_ENTER == 1 )
                printf( "\r\nThe trace will be dumped to disk if Enter is hit.\r\n" );
            #endif
            traceSTART();
        }
    #endif /* if ( projCOVERAGE_TEST != 1 ) */

        TaskHandle_t taskHandle;
        xTaskCreate( app_main_wrapper,             /* The function that implements the task. */
                     "app_main_wrapper",                      /* The text name assigned to the task - for debug only as it is not >
                     configMINIMAL_STACK_SIZE,        /* The size of the stack to allocate to the task. */
                     &command_line_args,               /* The parameter passed to the task - not used in this simple case. */
                     mainQUEUE_RECEIVE_TASK_PRIORITY, /* The priority assigned to the task. */
                     &taskHandle );                          /* The task handle is not required, so NULL is passed. */

    // Start the real time scheduler.
    vTaskStartScheduler();

    printf("vTaskStartScheduler() returned\n");

    printf("deleting taskHandle\n");
    vTaskDelete(taskHandle);

    printf("exiting main\n");
    return 0;
}

The failure occurs on the second test case that is run, the first run of xTaskCreateStatic() of the object is successful, but the next instance of the same object for the second test case has the failure as seen above.

cmorganBE commented 1 year ago

I guess on that topic, @Skptak, how should we be cleaning up all of the threads so we can re-run initialization from inside of the same process and start with a clean slate? There are actually a fair number of valgrind leaks due to threads that are ended (by calling vTaskDelete on themselves), but seemed to be stuck in a kind of cancelled state.

Is there a good location for a test for that? I'd be interested in writing some tests for the posix implementation that run under CI/CD and valgrind to replicate the issue so it can be investigated. We ended up having to use suppressions for the various leaks.

RichardBarry commented 1 year ago

As per the guidance when opening a git issue - please discuss this on the forums before discussing here, and open a bug report if that is the conclusion of the forum discussion - thanks.