Closed gvoysey closed 2 months ago
Hi @gvoysey this might occur if the sampled batch contains a very small cloud. Typically, this could be that:
Both of these situations may lead to spurious graphs with only 1 node after calling SampleSubNodes
and SampleRadiusSubgraphs
. This can be problematic at any level of the partition (level-0 excluded), because as of now, the code is not robust to these edge cases where single-node or empty graphs may be passed.
For deeper investigation, I suggest you save the NAG
to disk in OnTheFlyHorizontalEdgeFeatures
if one of the partition levels has no edge_index
(obviously enough, you need to do so before the error-prone call to _on_the_fly_horizontal_edge_features
). Even better, try to capture which cloud it comes from, to be able to reproduce this error consistently and investigate the problem more deeply.
we'll add that instrumentation and update. Is there a reliable way to detect this on the fly? If it happens during preprocessing, i'm curious why it doesn't arise every epoch, either.
This is likely a stochastic event happening at training time only, linked to the fact that SampleSubNodes
and OnTheFlyHorizontalEdgeFeatures
are designed for randomized batch construction. I suspect the issue is the conjunction of one of these and a specific cloud with outlying superpoints.
A reliable way for detecting it at train time is what I mentioned above. I suggest you try this first and find the cloud(s) tile(s) from which this error occurred. Once isolated, it will be easier to investigate the issue.
sounds good. My plan is to walk the preprocessed tree of *.h5 files with src.data.nag:Nag.load(...)
-- does this approach seem like it has any gotchas? I remember there's some subtleties in handling src.data.Data
w/r/t which keys get reinflated.
I do not think this will return any NAG
objects with empty data.edge_index
. As mentioned above, the issue arises only at training time, due to the conjunction of some unfortunate samplings and superpoint graphs with few nodes / with few neighbors. This is proven by the fact that you train for many epochs before the error randomly occurs.
So, use datamodule.train_dataloader()
to get a dataloader and loop over it multiple times instead.
# Loop over as many epoch as you'd like, until you encounter the error
for _ in range(num_trial_epochs):
# Reset the dataloader at each epoch
dataloader = datamodule.train_dataloader()
for nag_list in dataloader:
# Need to do this manually here because we are not using
# lightning's training loop syntax
nag = NAGBatch.from_nag_list([nag.cuda() for nag in nag_list])
nag = dataset.on_device_transform(nag)
# Test whatever
for i in range(1, nag.num_levels + 1):
if nag[i].edge_index is None or nag[i].edge_index.shape[1] == 0:
# Do something to store the data somewhere
# Ideally, you would be able to recover which
# preprocessed file it came from, but I leave this up to you
ah! ok, that looks like a closer replication of the error environment, while still being heaps faster!
Hi @gvoysey have you solved this issue ? May I close it ?
I think you can close this. We weren't able to add safeguards in the superpoint code to catch and continue when data.edge_index == None
, but we were able to successfully fully train a model after adjusting the gradient accumulator scheduling and our preprocessing tiling strategy.
I'm not amazingly confident in this approach since it's very much a rough heuristic that doesn't account for scene complexity, but we used a combination of rejecting small files and tiling large ones such that each lidar file in our train, val, and test sets contained a number of points in the closed interval [150_000, 2_000_000]
.
That let us train for ~800 epochs. Quality of results tbd and depend on many factors, but it didn't crash!
I see thanks for the feedback ! Well, that circumvented the problem. If you ever happen to isolate a problematic file, I can try to look into it. In the meantime, I am closing this issue.
We're training on a large (~70GB, 2493270064 points, 80/10/10 train/test/val split). We've got some pre-superpoint-transformer preprocessing in place -- primiarly, using lastile and lasground_new to pretile the set so that each tile is between 74934 and 6674441 points, and is flattened, and rejecting lasfiles that are too small or otherwise malformed.
Custom dataset yaml for datamodule and experiment as follows:
datamodule:
experiment:
When training with this dataset, we get between 100 and 200 epochs in and then get the following crash:
so at some point at
_on_the_fly_horizontal_edge_features
or above,data.edge_index
isNone
.