Is preprocessing parallelization possible?

We've been able to use concurrent.futures.ProcessPoolExecutor to handle some additional preprocessing that we perform before using superpoint transformer at all, and we get a huge speedup by doing independent work in parallel instead of serially per lasfile.

We're now training a model on roughly 10,000 lidar files, and are consequently spending a lot of time in dataset preprocessing before training begins (estimated time roughly 50h). I think this is because while the individual steps of the CPU and GPU preprocessing steps shown here are able to go fast (built against openMP or cuda), files are still being processed through the preprocessing pipeline serially.

I've had a quick look at the pytorch-lightning docs, but i'm not sure i've found anything there to answer this question concretely -- do you know if there's a way to parallelize the preprocessing steps? Lasfiles are independent of each other, so there's no problem there, but i'm not sure how to express the equivalent of a ProcessPoolExecutor in torch lightning to handle an I/O bound pipeline.

Hi @gvoysey ! I understand your need to accelerate the preprocessing on your large dataset. I agree that in general the preprocessing takes longer than I would like (also true for the on-device transforms). Especially if you compare to the speed of the forward pass on GPU. Yet, I think processing multiple tiles on a single machine will not work, for the simple reason that multiple operations of the preprocessing are already parallelized ans trying to use as much of your CPU/GPU resources as possible. While it might be possible to have multiple threads processing different tiles while sharing CPU resources, I think sharing GPU resources will be a pain and will likely break.

Here is a rough breakdown of the sensitive steps of the DALES preprocessing you mention:

pre_transform:
    SaveNodeIndex          # 
    DataTo                 # 
    GridSampling3D         # GPU-hungry 
    KNN                    # GPU-hungry
    DataTo                 # 
    GroundElevation        # 
    PointFeatures          # CPU-hungry-ish
    DataTo                 # 
    AdjacencyGraph         # 
    ConnectIsolated        # 
    DataTo                 # 
    AddKeysTo              # 
    CutPursuitPartition    # CPU-hungry will try to use as many cores as possible
    NAGRemoveKeys          # 
    NAGTo                  # 
    SegmentFeatures        # GPU-hungry
    RadiusHorizontalGraph  # GPU-hungry
    NAGTo                  #

Depending on your GPU/CPU memories and the size of your individual tiles, several operations are prone to clog up all your resources. You might want to investigate whether the preprocessing of your tiles ever uses all your CPU or GPU memory. If they don't, then you might want to increase your tile size so that they make better use of your resources. Suggested tools for such investigation: htop and torch profiler.

Yet, I agree that when the GPU is busy, the CPU resources could be used to do something else... But coding a program that efficiently distributes both CPU and GPU resources across concurrent processes may be a bit tough. At the moment, I do not know how/if this would be possible. This kind of resembles what a torch DataLoader does: CPU workers that asynchronously prepare batches while the model consumes the batches on GPU. So if you are dead set on this, you would have a look at how torch does it 😅 Yet, our problem is different in several aspects:

multiple GPU-based jobs to handle, but sequentially (otherwise GPU memory errors)
each process's tasks requires an alternance of CPU-based and GPU-based operations, so there is some back-and-forth

That being said, it would however be possible to parallelize the preprocessing across multiple machines. This would require distributing the files to preprocess across several computers.

In any case, if you decide to explore this and find a solution, I would very gladly welcome a PR ! ❤️

drprojects / superpoint_transformer

Is preprocessing parallelization possible? #132