Update for StableVideoDiffusionPipeline

I've updated the code so it is compatible with the StableVideoDiffusionPipeline. It handles the 5 dimensional input and applies token merging to the temporal attention.

Temporal attention is not a major performance bottleneck but there is a lot of redundancy and I wanted to see if it would work.