If I am reading the code correctly, it looks like in the case of SSE2 Faster currently falls back to calling round()/floor() etc on each individual lane via the fallback macro.
edit:
Agner's functions are slower but can handle floating point values that don't fit in an i32, the first functions only handle values that do fit in an i32.
If I am reading the code correctly, it looks like in the case of SSE2 Faster currently falls back to calling round()/floor() etc on each individual lane via the fallback macro.
You may be able to use these methods instead: http://dss.stephanierct.com/DevBlog/?p=8
Or Agner Fog has a different method in his vector library: http://www.agner.org/optimize/vectorclass.zip
edit: Agner's functions are slower but can handle floating point values that don't fit in an i32, the first functions only handle values that do fit in an i32.