question about vit's fsdp wrapping policy

lukaemon commented 1 month ago

In base_vision.py,

def get_fsdp_wrapping_policy(self) -> Callable:
    """Return a simple FSDP policy that wraps each ViT block and then the _entire_ featurizer."""
    vit_wrap_policy = partial(_module_wrap_policy, module_classes={VisionTransformer})
    transformer_block_policy = partial(transformer_auto_wrap_policy, transformer_layer_cls={Block})
    return partial(_or_policy, policies=[vit_wrap_policy, transformer_block_policy])

Since VisionTransformer is a superset of Block, why construct a _or_policy on top of them? Is wrapping the whole vit enough?

siddk commented 1 month ago

You’re totally right — this is a vestige of much earlier experiments where we thought that wrapping ViT blocks independently might shave more off the training memory footprint.

I’ll push a fix soon removing this!

lukaemon commented 1 month ago

Thx for explanation. 👍

TRI-ML / prismatic-vlms

question about vit's fsdp wrapping policy #29