Closed lomereiter closed 7 years ago
Hi,
you're right - we will definitely have a look into that by Thursday at most. It seems even more pressing when you run:
from IsoSpecPy import IsoSpecPy
from math import exp
from time import time
import numpy as np
formula = "C63H89Co1N14O14P1"
def getConfsNumpy(raw):
masses, logProbs, configurations = raw
rows_no = len(masses)
cols_no = len(configurations)/len(masses)
masses = np.array(list(masses))
logProbs= np.array(list(logProbs))
configurations = np.array(list(configurations)).reshape((rows_no,cols_no))
return masses, logProbs, configurations
for x in range(10):
t0 = time()
i = IsoSpecPy.IsoSpec.IsoFromFormula(formula, 0.99999)
t1 = time()
confs = i.getConfs()
t2 = time()
confs_raw = i.getConfsRaw()
t3 = time()
confs_np = getConfsNumpy(i.getConfsRaw())
t4 = time()
t_parse = t1 - t0
t_getConfs = t2 - t1
t_getConfs_raw = t3 - t2
t_getConfs_np = t4 - t3
print "parsing: ", t_parse
print "calculation default: ", t_getConfs, "(", t_getConfs / t_parse * 100, "%)"
print "calculation raw: ", t_getConfs_raw, "(", t_getConfs_raw / t_parse * 100, "%)"
print "calculation numpy: ", t_getConfs_np, "(", t_getConfs_np / t_parse * 100, "%)"
print "---"
i.e. the actual time needed to obtain the calculations is around 1% of the parsing time for small molecules. I also included a code for getting numpy objects, which is also a bit naive, but it was the quickest thing we could get in terms of necessary coding time.
Thanks for your remark!
Best wishes,
Actually, I think that the problem is an embarrassingly slow GetConfs() function (which is slow enough that it seems you thought that this is where the isotopic computations happen, while in fact they are performed as a part of the IsoFromFormula function). All that the GetConfs*() functions do is allow the user to access the results (previously computed in the C++ backend) through CFFI. GetConfsRaw() just returns the raw CFFI object and should be used where performance matters (works in O(1) IIRC), while GetConfs is a convenience function which wraps the configurations in a nice(-ish) Python structure of list of tuples. Sadly, it seems that constructing Python list in Python is about 10x slower than calculating the distribution itself in C++... So yes, if speed matters, please don't use GetConfs(), use GetConfsRaw() instead.
This is something we should document better... Maybe rename the function to GetConfsSlowAndFancy or something like that...
A better way to benchmark the speed of just the parsing would be to call
IsoSpecPy.IsoSpec.IsoFromFormula(formula, 0.0)
This returns an empty distribution and basically benchmarks formula parsing (and the construction of marginal distributions which will happen regardless, but this should be fast). Which does prove the point too: there's definitely room for improvement there. @MatteoLacki : can you test it further (and fix the docs)?
Having said that, it's something we'll have a look into for 2.0 (which will feature further speed improvements). Which will hopefully(*) be out in about a month.
(*) meaning that I hope to force myself to do it by actually making that statement in public ;)
@michalsta thanks for your insight! I ran line_profiler which shows that most of the time is indeed spent on just parsing, which is unsurprising given the number of ffi.string calls it makes...
In [1]: %load_ext line_profiler
In [2]: from IsoSpecPy import IsoSpecPy
In [3]: %lprun -f IsoSpecPy.IsoSpec.IsoFromFormula for _ in range(1000): IsoSpecPy.IsoSpec.IsoFromFormula("C100H200O50P5N10", 0.99999)
Timer unit: 1e-06 s
Total time: 7.81143 s
File: /home/lomereiter/miniconda2/envs/py3/lib/python3.5/site-packages/IsoSpecPy/IsoSpecPy.py
Function: IsoFromFormula at line 283
Line # Hits Time Per Hit % Time Line Contents
==============================================================
283 @staticmethod
284 def IsoFromFormula(formula, cutoff, tabSize = 1000, hashSize = 1000, classId = None, method = 'layered', step = 0.25, trim = True):
285 # It's much easier to just parse it in python than to use the C parsing function
286 # and retrieve back into Python the relevant object sizes
287 1000 19511 19.5 0.2 symbols = re.findall("\D+", formula)
288 1000 18550 18.6 0.2 atom_counts = [int(x) for x in re.findall("\d+", formula)]
289
290 1000 3209 3.2 0.0 if not len(symbols) == len(atom_counts):
291 raise ValueError("Invalid formula")
292
293 1000 2612 2.6 0.0 indexes = [[x for x in xrange(isoFFI.clib.NUMBER_OF_ISOTOPIC_ENTRIES)
294 if isoFFI.ffi.string(isoFFI.clib.elem_table_symbol[x]) == symbol.encode('latin1')]
295 1000 7459601 7459.6 95.5 for symbol in symbols]
296
297 1000 9423 9.4 0.1 if any([len(x) == 0 for x in indexes]):
298 raise ValueError("Invalid formula")
299
300 1000 41988 42.0 0.5 masses = [[isoFFI.clib.elem_table_mass[idx] for idx in idxs] for idxs in indexes]
301 1000 40348 40.3 0.5 probs = [[isoFFI.clib.elem_table_probability[idx] for idx in idxs] for idxs in indexes]
302
303 1000 2805 2.8 0.0 if classId == None:
304 1000 213386 213.4 2.7 return IsoSpec(atom_counts, masses, probs, cutoff, tabSize, hashSize, step, trim, method)
305 else:
306 return classId(atom_counts, masses, probs, cutoff, tabSize, hashSize, trim)
OK, it does seem that the culprit is the cffi string method (and bit of laziness on my part too ;). I've moved stuff around so that extracting element masses and so on from the periodic table in C library happens once at module load, and not on every method call and it works much faster now:
Before:
%time for _ in range(1000): IsoSpecPy.IsoSpec.IsoFromFormula("C100H200O50P5N10", 0.0)
CPU times: user 5.19 s, sys: 1 ms, total: 5.19 s
Wall time: 4.83 s
After:
%time for _ in range(1000): IsoSpecPy.IsoSpec.IsoFromFormula("C100H200O50P5N10", 0.0)
CPU times: user 70 ms, sys: 0 ns, total: 70 ms
Wall time: 65.7 ms
A little bit unexpected, as the CFFI is supposed to have practically zero overhead... not in this case it seems. Oh well. I've pushed out a 1.0.5 to pip including the fix. Thanks for helping us spot this.
Happy hacking and feel free to contact us if you have any other issues/questions ;)
By the way, seeing as your software seems (at an admittedly short glance) similar to IsoSpec (in the fact that it's a thin Python wrapper using CFFI around a C++11 library) you might want to consider interfacing directly to the C++ library instead of using the Python wrapper. The interface is slightly messy, but we can help with that ;)
Ok, Python package is now within 2x of C++ version, according to the profiler. Now that's reasonable :)
Hello, I tried to evaluate suitability of this package for my purposes, and it turns out that for small molecules the majority of time is spent on just parsing the formula. I saw the comment in
IsoFromFormula
, where it justifies not calling the C function and parsing in Python instead, and it appears that maybe it wasn't a good shortcut to take.I used the script below, and it consistently shows that 90% of time is spent on parsing (2-3 milliseconds) and not on the actual computation. I confirm that the calculation itself is impressively speedy just as promised :) (benchmarked it a bit against
isotopePattern
from our home-growncpyMSpec
package)