IMRCLab / crazyswarm2

A Large Quadcopter Swarm
MIT License
119 stars 60 forks source link

thread.join() stuck in swarm.py #522

Open jvilinsky opened 4 months ago

jvilinsky commented 4 months ago

Hi, I am having an error where when I run more than one crazyflie it will sometimes work and sometimes get stuck on line 259 in the parallel_safe function in swarm.py. It sometimes works and sometimes inexplicably doesn't work. It seems as if the threads get stuck for some reason and .join() will never execute.

Function for reference:

def parallel_safe(self, func, args_dict=None):
        """
        Execute a function for all Crazyflies in the swarm, in parallel.
        One thread per Crazyflie is started to execute the function. The
        threads are joined at the end and if one or more of the threads raised
        an exception this function will also raise an exception.

        For a more detailed description of the arguments, see `sequential()`

        :param func: The function to execute
        :param args_dict: Parameters to pass to the function
        """
        threads = []
        reporter = self.Reporter()

        for uri, scf in self._cfs.items():
            args = [func, reporter] + \
                self._process_args_dict(scf, uri, args_dict)

            thread = Thread(target=self._thread_function_wrapper, args=args)
            threads.append(thread)
            thread.start()

        for thread in threads:
            thread.join()

        if reporter.is_error_reported():
            first_error = reporter.errors[0]
            raise Exception('One or more threads raised an exception when '
                            'executing parallel task') from first_error
knmcguire commented 4 months ago

So, this seems to be an issue in the cflib backend of Crazyswarm2? This code is not part of the crazyswarm2 codebase itself

jvilinsky commented 4 months ago

Yes, sorry it is used in the crazyflie_server.py on line 213. It gets stuck right before the creation of servers and subscriptions.

    # Now all crazyflies are initialized, open links!
    try:
        self.swarm.open_links()
    except Exception as e:
        # Close node if one of the Crazyflies can not be found
        self.get_logger().info("Error!: One or more Crazyflies can not be found. ")
        self.get_logger().info("Check if you got the right URIs, if they are turned on" +
                               " or if your script have proper access to a Crazyradio PA")
        exit()

Thanks for the quick reply!

knmcguire commented 4 months ago

So there aren't any error messages? like that it is not able to connect to one of the uris?

jvilinsky commented 4 months ago

No error messages which is why im so stuck, I narrowed it down to a problem with threading as mentioned before, the .join() function from thread class in threading waits for parallel threads finish before joining so if they never finish it will never join. It seems like there might be an infinite loop somewhere. It could also possibly be getting stuck with self._connect_event.wait() in the open_link function which is being used in the threads (line 90 of SyncCrazyflie.py in cflib.crazyflie).

knmcguire commented 4 months ago

I'm unfortunately not able to recreate your issue I'm afraid... In general it usually takes time for the Crazyradio to download all the parameters/log tocs from the crazyflies before it says it is fully connected, and that time is multiplied with the crazyflies you connect too.

But the getting stuck I've never seen before. What is the OS that you are running it from? Python threading is messy and if you would run this from a VM with limited resources then I would expect some issues.

jvilinsky commented 4 months ago

That makes sense, the setup im using is: OS: Ubuntu 22.04.4 LTS x86_64 CPU: 12th Gen Intel i7-12700K GPU: NVIDIA GeForce RTX 3070 Ti Memory: 5159MiB / 31878MiB

knmcguire commented 4 months ago

Thanks for sharing the information!

This seems like a very capable computer... so I don't think that that is the issue. Which version of python do you have installed and which version of the CFlib do you have?

jvilinsky commented 4 months ago

I have python version 3.10.12 and I think CFlib version 0.1.25.1

knmcguire commented 3 months ago

Alright.. that's also exactly what I have.

Unfortunatly we can't recreate it at this moment so the best for now is just to restart the server, how ugly that solution is. I haven't seen this happen in the CI either so perhaps there are some timing issues that might cause this as well.

I'll keep it open here so that others can pitch in and let it know if they also experience the same issue.