EddyRivasLab / easel

Sequence analysis library used by Eddy/Rivas lab code
Other
46 stars 26 forks source link

Is eslREADBUFSIZE = 4096 optimal? #51

Closed Augustin-Zidek closed 4 years ago

Augustin-Zidek commented 4 years ago

The default eslREADBUFSIZE block size of 4 kB leads to a large number of ftell and fread calls (\<db size> / 4096 calls).

I experimented with a 64 GB database using jackhmmer with these settings:

jackhmmer -A out.a3m --noali --F1 0.0005 --F2 0.00005 --F3 0.000005 --incE 0.001 -E 0.001 --cpu 8 -N 1 test.fasta db.fasta

Here are the times when searching against the database on a local SSD:

Size       Time  Mc/sec
  4 kB ... 4:52  44631.81    <- this is the default
  8 kB ... 4:49  45136.58
 16 kB ... 4:48  45282.81
 32 kB ... 4:33  47722.62
 64 kB ... 4:33  47739.30    <- this has the best performance
128 kB ... 4:38  46836.82
256 kB ... 6:34  33130.65
512 kB ... 6:17  34641.72
  1 MB ... 5:05  42733.90

While my profiling is not very extensive, I think it is worth at least flagging as it could lead to a nice performance win in certain cases.

cryptogenomicon commented 4 years ago

This is hardware and system dependent. I've tested the default setting extensively across many systems, and it is a good compromise.

Augustin-Zidek commented 4 years ago

Sounds good, thanks.

tcoates5 commented 1 year ago

Having a flag to be able to set this based on the hardware you use would be quite nice, as when using AlphaFold, MSA construction is often a large portion of the compute time. Looks like in the case of the hardware used by @Augustin-Zidek he got >5% gain from that one setting.