Understanding patch processing in EViG modules

danielIonita2022 commented 7 months ago

Hello, Thank you for sharing the code.

While reading the paper I understood that the Vision Graph module first splits an image into patches that will then be treated as nodes in a graph, but as I studied the original ViG code and also this one I couldn't figure out where the patches were! As you based this architecture on the nnunet framework I saw the patches being mentioned in the data loader part, but the way I see it the original image is cropped using this patch size and that's it, not an image being split into multiple patches which are then sent forward in the network.

The only other place where I thought the splitting into patches could be done is in the position embedding with the n parameter, but that only represents the number of voxels in an image (for 3D).

In the end, as I see from the code, the Grapher module treats each voxel as a node, finding its k nearest neighboring voxels based on their features, which I don't think is as useful as the patch approach discussed in both papers.

If you could please help me understand how these patches are processed or if there are any at all, it would help my research a lot! Thank you again and sorry for bothering!

PengchengShi1220 commented 7 months ago

Hello @danielIonita2022,

Thank you for your insightful question. It's great to see such deep engagement with the code and the underlying concepts.

Your observation about the handling of patches in the Vision Graph (ViG) module is astute. In the ViG, as seen in the stem operation of the ViG, an image is first split into patches. This is conceptually similar to the OverlapPatchEmbed used in the PVT, which you can examine OverlapPatchEmbed of the PVT. In these operations, different convolutions are applied, such as a kernel size of 7 with a stride of 4, and three kernels of size 3, with two having a stride of 2 and one with a stride of 1. This essentially achieves a downsampling effect, which is consistent with the operations in the ViG.

The patches in the context of these models should be understood as portions of the input image at each stage, with the embedding dimensions corresponding to the number of channels. Each stage has its distinct patch embedding.

Regarding your specific query about the Grapher module's treatment of voxels as nodes and the application of k-nearest neighbors based on features, your understanding is correct.

If you have further questions or need more clarification, feel free to ask.

Best, Pengcheng

danielIonita2022 commented 7 months ago

Thank you for the quick and detailed response!

I believe this clears some of my confusion regarding the patches, all that remains is for me to take a deeper dive into the code to understand precisely what you told me. I will come back with another reply in the coming days if I get stuck again, otherwise this issue can be closed.

Best regards, Daniel Ionita

PengchengShi1220 / NexToU

Understanding patch processing in EViG modules #7