Long and hard one here, but well worth it since we now have SSE support: Vectorization, This is a hard topic and one that is in great research all the time, but SSE does a few things for us that we can (ab)use to gain performance.
Since the instructions can be used to do integer arithmetic in packed form, we can use them to do in less instructions the exact same thing, an example shown in this Wikipedia article
There's quite a lot of information about automatic vectorization out there, however, I've found that there's no platform-specific papers to be had, so applying the vectorization is the part that would most likely be difficult.
Long and hard one here, but well worth it since we now have SSE support: Vectorization, This is a hard topic and one that is in great research all the time, but SSE does a few things for us that we can (ab)use to gain performance.
Since the instructions can be used to do integer arithmetic in packed form, we can use them to do in less instructions the exact same thing, an example shown in this Wikipedia article
I did digging a while ago for this as well, LLVM's polly tool has a lot of useful publications about polyhedral optimizations (which are used for Vectorization)
Wikipedia gives a nice overview
There's quite a lot of information about automatic vectorization out there, however, I've found that there's no platform-specific papers to be had, so applying the vectorization is the part that would most likely be difficult.