Open samdporter opened 3 weeks ago
@samdporter can you give some more detail? How did you run the data_partition
function? Ideally code snippet. Did you see GPU errors such as
cudaMalloc returned error no CUDA-capable device is detected (code 100), line(57)
Hey Kris,
The partition was used in the same way as in the example files (in fact I saw the same behaviour when using main_ISTA.py
)
The error was segmentation fault (core dumped)
- exactly the same as I've previously seen when using the partitioner without setting AcuisitionData.set_storage_scheme('memory')
. This only ever occurred when using the partitioner and an edge-gpu docker container.
class Submission(ISTA):
def __init__(self, data: Dataset, update_objective_interval=10):
"""
Initialisation function, setting up data & (hyper)parameters.
"""
# Very simple heuristic to determine the number of subsets
self.num_subsets = calculate_subsets(data.acquired_data, min_counts_per_subset=2**20, max_num_subsets=16)
update_interval = self.num_subsets
# 10% decay per update interval
decay_perc = 0.1
decay = (1/(1-decay_perc) - 1)/update_interval
beta = 0.5
# error only ever occurs here
_, _, obj_funs = partitioner.data_partition(data.acquired_data, data.additive_term,
data.mult_factors, self.num_subsets, mode='staggered',
initial_image=data.OSEM_image)
AcquisitionData.set_storage_scheme('memory')
is currently required for the subsets. I'd have hoped it would generate a warning as opposed to a crash.
Can you confirm you had crashes with "memory" on?
@samdporter can you please confirm here that
cudaMalloc
errorsmain_ISTA
, both on your edge-gpu
docker image and when you submitted it (if there's an explicit job/tag you could refer to, that'd be great)cudaMalloc
errorsedge-gpu
docker image but never attempted to submit main_ISTA. I saw this issue using my algorithms on my edge-gpu
docker image and when submitting. Here is the job tag for the most recent submission. (I have just resubmitted the job in a container running on my machine and it's working fine, which is a bit confusing).
Unfortunately far too late to do anything about it now...
I'm seeing intermittent segmentation faults caused by the
partitioner.data_partition
function. It's only apparent when using the edge-gpu docker image and I haven't seen it before today - but this could possibly have been down to luck as I can't see an obvious culprit in any recent commits.I don't see this when I run locally.