compomics / moFF

A modest Feature Finder (moFF) to extract MS1 intensities from Thermo raw file
Apache License 2.0
33 stars 11 forks source link

mzML Speedup #13

Closed mafreitas closed 7 years ago

mafreitas commented 7 years ago

I have been testing the code with mzML. A significant speedup can be gained by indexing the mxML spectra first and then performing XIC only over those ranges. I can share the code along with a couple other possible issues if you like.

For example in moff.py

   else:
        #mzML work only with --inputraw option
        loc  = raw_name
        rt_list = []
        runid_list = []

        if ('MZML' in raw_name.upper()):
            flag_mzml = True

            run_temp = pymzml.run.Reader(raw_name)

            for spectrum in run_temp:
                if spectrum['ms level'] == 1:
                    rt_list.append(spectrum['scan start time'])
                    runid_list.append(spectrum['id'])

            #print rt_list

and

def pyMZML_xic_out(name, ppmPrecision, minRT, maxRT, MZValue,run, runid_list,rt_list):
    #run = pymzml.run.Reader(name, MS1_Precision=ppmPrecision, MSn_Precision=ppmPrecision)
    timeDependentIntensities = []

    minpos = bisect.bisect_left(rt_list, minRT)
    maxpos = bisect.bisect_left(rt_list, maxRT)

    #print minpos, maxpos, rt_list[minpos], rt_list[maxpos], runid_list[minpos], runid_list[maxpos]

    for specpos in range(minpos,maxpos):
        specid = runid_list[specpos]

        spectrum = run[specid]

    #for spectrum in run:

        if spectrum['scan start time'] > maxRT:
            break
        if spectrum['ms level'] == 1 and spectrum['scan start time'] > minRT and spectrum['scan start time'] < maxRT:
            #print minRT, maxRT, spectrum['scan start time']
            lower_index = bisect.bisect(spectrum.peaks, (float(MZValue - ppmPrecision * MZValue), None))
            upper_index = bisect.bisect(spectrum.peaks, (float(MZValue + ppmPrecision * MZValue), None))
            maxI = 0.0
            for sp in spectrum.peaks[lower_index: upper_index]:
                if sp[1] > maxI:
                    maxI = sp[1]
            if maxI > 0:
                timeDependentIntensities.append([spectrum['scan start time'], maxI])

    if len(timeDependentIntensities) > 5:
        return (pd.DataFrame(timeDependentIntensities, columns=['rt', 'intensity']), 1)
    else:
        return (pd.DataFrame(timeDependentIntensities, columns=['rt', 'intensity']), -1)

also on or around line 315. The incrementor for the variable c is after the continue statement. If it is not an artifact it should be corrected.

        except (IndexError, ValueError, TypeError):
            log.info('peptide at line %i -->  MZ: %4.4f RT: %4.4f ', (offset_index +c +2), row['mz'], time_w)
            log.warning("\t size is not enough to detect the max peak line : %i", c)
            continue
            c += 1
Maux82 commented 7 years ago

Hi, Thank you very much for this code to speed up the XiC computation using pymzML. The idea to store all rt and the spectrum_Id for all he MS1 scans is great, but it would make sense to do this just one time defore the splitting of the of inputdata in different threads.

How does this behaive with the multithreding ? Do you read and save all the scan just one time before the multithreding part or in each thread ?

In the next days I will update the code of if you prefer you can also pull a request with your modification and I will merge it.

mafreitas commented 7 years ago

I am not the best with github, but I did try to create a pull request with a working version of the code that I have modified. It reads the code once before splitting. The speedup is significant. If done correctly, the need to use conditionals to check for MS level and RT window could also be eliminated.

I have tested with multiprocessing using 4 threads and works fine.

Michael Freitas mike.freitas@gmail.com

My Free / Busy Calendar can be found here: http://doodle.com/mikefreitas

Please support my ride to end cancer by donating to Pelotonia using the links below. Pelotonia Rider ID: MF0044 http://pelotonia.org/mikefreitas https://www.mypelotonia.org/donate_cart.jsp?MemberID=151635

On Wed, Jun 21, 2017 at 5:21 PM, Maux82 notifications@github.com wrote:

Hi, Thank you very much for this code to speed up the XiC computation using pymzML. The idea to store all rt and the spectrum_Id for all he MS1 scans is great, but it would make sense to do this just one time defore the splitting of the of inputdata in different threads.

How does this behaive with the multithreding ? Do you read and save all the scan just one time before the multithreding part or in each thread ?

In the next days I will update the code of if you prefer you can also pull a request with your modification and I will merge it.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/compomics/moFF/issues/13#issuecomment-310208866, or mute the thread https://github.com/notifications/unsubscribe-auth/ADTcJRUzvPUKU3XK1l_5yn0yndf1sulnks5sGYlUgaJpZM4OBgf9 .

Maux82 commented 7 years ago

Thanks very much for you contribution.

I have merged your PR in the master branch :)