SyneRBI / PETRIC

PET Image Reconstruction Challenge 2024
https://www.ccpsynerbi.ac.uk/events/petric/
8 stars 2 forks source link

intermittent segmentation faults caused by `partitioner.data_partition` #120

Open samdporter opened 3 weeks ago

samdporter commented 3 weeks ago

Unfortunately far too late to do anything about it now...

I'm seeing intermittent segmentation faults caused by the partitioner.data_partition function. It's only apparent when using the edge-gpu docker image and I haven't seen it before today - but this could possibly have been down to luck as I can't see an obvious culprit in any recent commits.

I don't see this when I run locally.

KrisThielemans commented 3 weeks ago

@samdporter can you give some more detail? How did you run the data_partition function? Ideally code snippet. Did you see GPU errors such as

cudaMalloc returned error no CUDA-capable device is detected (code 100), line(57)
samdporter commented 3 weeks ago

Hey Kris, The partition was used in the same way as in the example files (in fact I saw the same behaviour when using main_ISTA.py) The error was segmentation fault (core dumped) - exactly the same as I've previously seen when using the partitioner without setting AcuisitionData.set_storage_scheme('memory'). This only ever occurred when using the partitioner and an edge-gpu docker container.

class Submission(ISTA):

    def __init__(self, data: Dataset, update_objective_interval=10):
        """
        Initialisation function, setting up data & (hyper)parameters.
        """
        # Very simple heuristic to determine the number of subsets
        self.num_subsets = calculate_subsets(data.acquired_data, min_counts_per_subset=2**20, max_num_subsets=16) 
        update_interval = self.num_subsets
        # 10% decay per update interval
        decay_perc = 0.1
        decay = (1/(1-decay_perc) - 1)/update_interval
        beta = 0.5

       # error only ever occurs here

        _, _, obj_funs = partitioner.data_partition(data.acquired_data, data.additive_term,
                                                                    data.mult_factors, self.num_subsets, mode='staggered',
                                                                    initial_image=data.OSEM_image)
KrisThielemans commented 3 weeks ago

AcquisitionData.set_storage_scheme('memory') is currently required for the subsets. I'd have hoped it would generate a warning as opposed to a crash.

Can you confirm you had crashes with "memory" on?

KrisThielemans commented 2 weeks ago

@samdporter can you please confirm here that

samdporter commented 2 weeks ago