Open alexsr opened 3 years ago
Same here.
When I run examples/classification_modelnet40.py, I get the same error at the same location. self.F[X.inverse_mapping(self.coordinate_map_key).long()]
My environment: CUDA 11.1 Pytorch 1.9 Ubuntu 18.04 ME: 0.5.4
to reproduce, just run examples/classification_modelnet40.py and don't touch anything. It falls when the training process runs the 307th iteration.
pred_batched = [pred[row_idx] for row_idx in out.decomposition_permutations if row_idx.shape[0] != 0] pred_batched = [p[inv_mappings] for p in pred_batched]
You can change the args
--network
from the defaultminkfcnn
tominksplatfcnn
to avoid this error. Actually, theME.Tensorfield
often cause GPU backward issues as the author said, but I'm a newbie to this fancy project so I have no idea how to modifyminkfcnn
network to fixME.SparseTensor
input.
To examplify my issue, here is the forward part of minkfcnn
in classification_modelnet40.py
def forward(self, x: ME.Tensorfield):
x = self.mlp1(x)
y = x.sparse()
y = self.conv1(y)
y1 = self.pool(y)
y = self.conv2(y1)
y2 = self.pool(y)
y = self.conv3(y2)
y3 = self.pool(y)
y = self.conv4(y3)
y4 = self.pool(y)
x1 = y1.slice(x)
x2 = y2.slice(x)
x3 = y3.slice(x)
x4 = y4.slice(x)
x = ME.cat(x1, x2, x3, x4)
y = self.conv5(x.sparse())
x1 = self.global_max_pool(y)
x2 = self.global_avg_pool(y)
return self.final(ME.cat(x1, x2)).F
If Tensorfield
is changed to SparseTensor
, the ME.cat
operation will not work. However, use invert_mapping slice(x)
to transferME.SparseTensor
into ME.Tensorfield
is not allowed in training phase. So I have no idea to fix this error in ME.cat
operation.
coords = ME.utils.batched_coordinates([c for c in coords_mapped], dtype=torch.int32)
Is the snippets about pre_batched
is necessary for backward? (it is not accessed in prediction. Does it affect the backward process? )
I used pred_batched
so that I had both the predictions and the inputs as batches. That way I could do visualization or whatever on the batches. The code snippet is just meant to show how to recover batches if necessary.
The outputs used for the loss (backward step) don't have to be in batch format.
Remember, the code snippet I posted here is just a quick fix I came up with for the issues I faced while trying to use TensorField
which did not work for me.
I fixed this error by using torch1.10.0+cuda11.3
I have the same error
I have the same error too when I use the MinkowskiEngine with tag v0.5.4. Then I checkout commit 02fc608bea4c0549b0a7b00ca1bf15dee4a0b228 and re-install the MinkowskiEngine, and the error disappears.
Can you release another version. This 0.5.4 is too misleading, it wasted a lot of my time due to this error. I think the latest release version is relatively stable, so I chose the latest release version Minkowski Engine 0.5.4.
Hello, first and foremost, thank you for creating this library! I may have found a bug, but I also found a workaround / solution and thought I'd share it here.
TL;DR: When using
TensorField
while training, sometimes the back-prop through the inverse mapping fails. To get around this issue, useSparseTensor
instead and only do inverse mapping for predictions using the map fromsparse_collate
.Describe the bug I am using MinkowskiEngine 0.5.4 with Pytorch 1.9.0 and Cuda 11.1 in my personal project on the ScanNet dataset. While training I have run into the following error a few times now:
This error did not occur while running predictions. I also looked at the issues #283 and #299. However, the current Readme states, that the issues, that might have caused these errors, are resolved in the current version of the library. Running my script with
with torch.autograd.detect_anomaly()
I was able to find the location that caused the error:To Reproduce Here are the relevant parts of my code that are used for collation, creation of the
TensorField
, the forward method of my model (MinkUNet34C
) and the loss computation. The current state of my framework is quite complicated, therefore I cannot really share the complete code. But this bug should also probably occur when using the ScanNet demo for training. Now, this works perfectly fine for most ScanNet scans. However, there are a few, e.g.scene_0029_00
where the crash occurs. It seems that there is an issue with backpropagation throughIndexBackward
.I was able to get around this error by using
SparseTensor
,sparse_collate
to get the mappings and inverse mappings, and doing the inverse mapping myself in case I needed the prediction output.These snippets are supposed to replace the equivalent snippets in the upper code sample:
Additional context As an aside, the example in the Training Tutorial shows this code snippet:
when it should be:
The example in the function documentation is also not up to date here as it should return
unique_coords
or usereturn_maps_only=True
as an arg: