[Python] large overhead of IsoFromFormula

lomereiter commented 7 years ago

Hello, I tried to evaluate suitability of this package for my purposes, and it turns out that for small molecules the majority of time is spent on just parsing the formula. I saw the comment in IsoFromFormula, where it justifies not calling the C function and parsing in Python instead, and it appears that maybe it wasn't a good shortcut to take.

I used the script below, and it consistently shows that 90% of time is spent on parsing (2-3 milliseconds) and not on the actual computation. I confirm that the calculation itself is impressively speedy just as promised :) (benchmarked it a bit against isotopePattern from our home-grown cpyMSpec package)

from IsoSpecPy import IsoSpecPy
from math import exp
from time import time

formula = "C63H89Co1N14O14P1"

for x in range(10):
    t = time()
    i = IsoSpecPy.IsoSpec.IsoFromFormula(formula, 0.99999)
    t1 = time()
    confs = i.getConfs()
    t2 = time()
    print "parsing: ", t1 - t
    print "calculation: ", t2 - t1, "(", (t2 - t1) / (t2 - t) * 100, "%)"
    print "---"

MatteoLacki commented 7 years ago

Hi,

you're right - we will definitely have a look into that by Thursday at most. It seems even more pressing when you run:

from IsoSpecPy import IsoSpecPy
from math import exp
from time import time
import numpy as np

formula = "C63H89Co1N14O14P1"

def getConfsNumpy(raw):
    masses, logProbs, configurations = raw
    rows_no = len(masses)
    cols_no = len(configurations)/len(masses)
    masses  = np.array(list(masses))
    logProbs= np.array(list(logProbs))
    configurations = np.array(list(configurations)).reshape((rows_no,cols_no))
    return masses, logProbs, configurations

for x in range(10):
    t0 = time()
    i = IsoSpecPy.IsoSpec.IsoFromFormula(formula, 0.99999)
    t1 = time()
    confs = i.getConfs()
    t2 = time()
    confs_raw = i.getConfsRaw()
    t3 = time()
    confs_np  = getConfsNumpy(i.getConfsRaw())
    t4 = time()
    t_parse         = t1 - t0
    t_getConfs      = t2 - t1
    t_getConfs_raw  = t3 - t2
    t_getConfs_np   = t4 - t3
    print "parsing: ", t_parse
    print "calculation default: ", t_getConfs, "(", t_getConfs / t_parse * 100, "%)"
    print "calculation raw:     ", t_getConfs_raw, "(", t_getConfs_raw / t_parse * 100, "%)"
    print "calculation numpy:   ", t_getConfs_np, "(", t_getConfs_np / t_parse * 100, "%)"
    print "---"

i.e. the actual time needed to obtain the calculations is around 1% of the parsing time for small molecules. I also included a code for getting numpy objects, which is also a bit naive, but it was the quickest thing we could get in terms of necessary coding time.

Thanks for your remark!

Best wishes,

michalsta commented 7 years ago

Actually, I think that the problem is an embarrassingly slow GetConfs() function (which is slow enough that it seems you thought that this is where the isotopic computations happen, while in fact they are performed as a part of the IsoFromFormula function). All that the GetConfs*() functions do is allow the user to access the results (previously computed in the C++ backend) through CFFI. GetConfsRaw() just returns the raw CFFI object and should be used where performance matters (works in O(1) IIRC), while GetConfs is a convenience function which wraps the configurations in a nice(-ish) Python structure of list of tuples. Sadly, it seems that constructing Python list in Python is about 10x slower than calculating the distribution itself in C++... So yes, if speed matters, please don't use GetConfs(), use GetConfsRaw() instead.

This is something we should document better... Maybe rename the function to GetConfsSlowAndFancy or something like that...

A better way to benchmark the speed of just the parsing would be to call IsoSpecPy.IsoSpec.IsoFromFormula(formula, 0.0) This returns an empty distribution and basically benchmarks formula parsing (and the construction of marginal distributions which will happen regardless, but this should be fast). Which does prove the point too: there's definitely room for improvement there. @MatteoLacki : can you test it further (and fix the docs)?

Having said that, it's something we'll have a look into for 2.0 (which will feature further speed improvements). Which will hopefully(*) be out in about a month.

(*) meaning that I hope to force myself to do it by actually making that statement in public ;)

lomereiter commented 7 years ago

@michalsta thanks for your insight! I ran line_profiler which shows that most of the time is indeed spent on just parsing, which is unsurprising given the number of ffi.string calls it makes...


In [1]: %load_ext line_profiler

In [2]: from IsoSpecPy import IsoSpecPy

In [3]: %lprun -f IsoSpecPy.IsoSpec.IsoFromFormula for _ in range(1000): IsoSpecPy.IsoSpec.IsoFromFormula("C100H200O50P5N10", 0.99999)
Timer unit: 1e-06 s

Total time: 7.81143 s
File: /home/lomereiter/miniconda2/envs/py3/lib/python3.5/site-packages/IsoSpecPy/IsoSpecPy.py
Function: IsoFromFormula at line 283

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   283                                               @staticmethod
   284                                               def IsoFromFormula(formula, cutoff, tabSize = 1000, hashSize = 1000, classId = None, method = 'layered', step = 0.25, trim = True):
   285                                                   # It's much easier to just parse it in python than to use the C parsing function
   286                                                   # and retrieve back into Python the relevant object sizes
   287      1000        19511     19.5      0.2          symbols = re.findall("\D+", formula)
   288      1000        18550     18.6      0.2          atom_counts = [int(x) for x in re.findall("\d+", formula)]
   289                                           
   290      1000         3209      3.2      0.0          if not len(symbols) == len(atom_counts):
   291                                                       raise ValueError("Invalid formula")
   292                                           
   293      1000         2612      2.6      0.0          indexes = [[x for x in xrange(isoFFI.clib.NUMBER_OF_ISOTOPIC_ENTRIES)
   294                                                                   if isoFFI.ffi.string(isoFFI.clib.elem_table_symbol[x]) == symbol.encode('latin1')]
   295      1000      7459601   7459.6     95.5                      for symbol in symbols]
   296                                           
   297      1000         9423      9.4      0.1          if any([len(x) == 0 for x in indexes]):
   298                                                       raise ValueError("Invalid formula")
   299                                           
   300      1000        41988     42.0      0.5          masses  = [[isoFFI.clib.elem_table_mass[idx] for idx in idxs] for idxs in indexes]
   301      1000        40348     40.3      0.5          probs   = [[isoFFI.clib.elem_table_probability[idx] for idx in idxs] for idxs in indexes]
   302                                           
   303      1000         2805      2.8      0.0          if classId == None:
   304      1000       213386    213.4      2.7              return IsoSpec(atom_counts, masses, probs, cutoff, tabSize, hashSize, step, trim, method)
   305                                                   else:
   306                                                       return classId(atom_counts, masses, probs, cutoff, tabSize, hashSize, trim)

michalsta commented 7 years ago

OK, it does seem that the culprit is the cffi string method (and bit of laziness on my part too ;). I've moved stuff around so that extracting element masses and so on from the periodic table in C library happens once at module load, and not on every method call and it works much faster now:

Before:

%time  for _ in range(1000): IsoSpecPy.IsoSpec.IsoFromFormula("C100H200O50P5N10", 0.0)
CPU times: user 5.19 s, sys: 1 ms, total: 5.19 s
Wall time: 4.83 s

After:

%time  for _ in range(1000): IsoSpecPy.IsoSpec.IsoFromFormula("C100H200O50P5N10", 0.0)
CPU times: user 70 ms, sys: 0 ns, total: 70 ms
Wall time: 65.7 ms

A little bit unexpected, as the CFFI is supposed to have practically zero overhead... not in this case it seems. Oh well. I've pushed out a 1.0.5 to pip including the fix. Thanks for helping us spot this.

Happy hacking and feel free to contact us if you have any other issues/questions ;)

michalsta commented 7 years ago

By the way, seeing as your software seems (at an admittedly short glance) similar to IsoSpec (in the fact that it's a thin Python wrapper using CFFI around a C++11 library) you might want to consider interfacing directly to the C++ library instead of using the Python wrapper. The interface is slightly messy, but we can help with that ;)

lomereiter commented 7 years ago

Ok, Python package is now within 2x of C++ version, according to the profiler. Now that's reasonable :)

MatteoLacki / IsoSpec

[Python] large overhead of IsoFromFormula #1