add benchmarks numbers for rtx4000ada (non-sff)

benchmarks were run the example from the readme (without api). no init image was used. using an init image seemed to speed up generation on the rtx4000 at 1024x1024 by 0.16 it/s

the config-dev-offload-1-4080.json was used with the following modified keys:

"ae_dtype": "bfloat16", "text_enc_dtype": "bfloat16", "flow_quantization_dtype": "qfloat8", "text_enc_quantization_dtype": "qint4", "ae_quantization_dtype": "qfloat8", "compile_extras": false, "compile_blocks": false, "offload_text_encoder": true, "offload_vae": false, "offload_flow": false

offloading_flow=true strangely caused an out-of memory when generating a second image.

all in all, the 4090 seems about 2.8x faster than the rtx4000ada (non-sff) which is in line with power consumption and other hardware specifications.

aredden / flux-fp8-api

add benchmarks numbers for rtx4000ada (non-sff) #14