MannLabs / alphatims

An open-source Python package for efficient accession and visualization of Bruker TimsTOF raw data from the Mann Labs at the Max Planck Institute of Biochemistry.
https://doi.org/10.1016/j.mcpro.2021.100149
Apache License 2.0
81 stars 25 forks source link

np.uint16 sufficient for intensities? #155

Closed liquidcarbon closed 2 years ago

liquidcarbon commented 2 years ago

Hello, do I understand correctly that you are putting intensities into np.uint16 that only goes to 65535? I believe it should be np.uint32...

https://github.com/MannLabs/alphatims/blob/12b9a137dae8a13a8febebd6d62dde69c3f3e9e6/alphatims/bruker.py#L503

Consider this dataset that you can get from UCSD MASSIVE:

import os
from ftplib import FTP
ftp_url = 'ftp://massive.ucsd.edu/MSV000084402/'

def start_ftp():
    sample_path = 'MSV000084402/raw/SRM1950_20min_88_01_6950.d'
    ftp = FTP('massive.ucsd.edu')
    ftp.login()
    ftp.cwd(sample_path)
    return ftp

ftp = start_ftp()

os.mkdir('lipidomics.d')
with open('analysis.tdf_bin', "wb") as f:
    ftp.retrbinary("RETR " + 'lipidomics.d/analysis.tdf_bin', f.write)

with open('analysis.tdf', "wb") as f:
    ftp.retrbinary("RETR " + 'lipidomics.d/analysis.tdf', f.write)

# pip install opentimspy opentims_bruker_bridge
import opentims_bruker_bridge
from opentimspy.opentims import OpenTIMS

TIMS_FOLDER = 'lipidomics.d'
D = OpenTIMS(TIMS_FOLDER)

frames = pd.DataFrame(D.frames)
ms1_frame_numbers = frames.query("(MsMsType == 0) & (ScanMode != 0)").Id.to_numpy()
print(f'{D}, {len(frames)} frames, {len(ms1_frame_numbers)} MS1 frames')

MIN_INTENSITY = 500
ms1 = (
    pd.DataFrame(D.query(frames=ms1_frame_numbers[:500]))
    .query(f'intensity > {MIN_INTENSITY}')
    .reset_index(drop=True)
)
ms1[ms1.intensity > 65535].shape

Returns 1470 rows. There are definitely intensities in the hundreds of thousands - outside np.uint16 range.

swillems commented 2 years ago

Dear Alexander,

Thanks for checking this! This issue is actually quite complicated.

The short answer: The detector of a TimsTOF pro is a "multi-channel plate ion detector coupled to a 10-bit digitizer" (https://doi.org/10.1074/mcp.TIR118.000900). A simple interpretation is thus that the intensity measurements of a single detector event are easily represented in an uint16.

The long answer: As AlphaTims is provided first-and-foremost as a tool to assess the actual raw data without any (pre)processing, the interpretation of intensity values might differ from other tools.

To illustrate that an uint16 is indeed sufficient to handle the raw data, I ensured the intensities of your sample were retrieved as np.uint32 by modifiying lines 433 and 503. This did not actually yield different results.

As a second test, I inserted the following 2 lines in an editable version of AlphaTims between line 290 and line 291 to check the data that is retrieved directly from the "analysis.tdf_bin" in which the actual raw data is stored:

    intensities = buffer[scan_count + 1::2] #alphatims.bruker.line 290
    if np.max(intensities) > 60000:
        print(str(np.max(intensities))+"\n")
    last_scan = len(intensities) - np.sum(scan_indices[1:])  #alphatims.bruker.line 291

Notice that the buffer from line 278 is actually a np.uint32 and that the intensities are only downcasted to uint16 later. This does not print any large intensities for your Massive file MSV000084402, indicating that large values are indeed not physically detected.

Without eleborating too much, what you are seeing with other tools (which are all Bruker SDK based) are processed intensities. Especially in the case of detector oversaturation this is an important step for downstream analysis. At the MannLabs we are developing tools that can do similar processing, but this is not (yet) incorporated in AlphaTims. For now I will raise a warning to make the user aware of this fact

Best, Sander

liquidcarbon commented 2 years ago

@swillems thanks for your response and pointing out that paper! Until last night, I thought I was approaching some decent level of understanding of TIMS data, and now I'm back to thoroughly confused. :)

1) I'd like to understand better raw vs processed intensities. In the same dataset, looking at analysis.tdf we also see large numbers for intensities. Are the intensities stored in the Frames table not raw either? How/when does this processing happen?

conn = sqlite3.connect(os.path.join(TIMS_FOLDER, 'analysis.tdf'))
q = '''
SELECT * FROM Frames
ORDER BY MaxIntensity DESC
LIMIT 5
'''
pd.read_sql(q, conn)

image

2) Looking at the actual MaxIntensity numbers, I'm suspicious of saturation -- why would several frames yield precisely the same intensities for the highest peak? I tried both alphatims and opentimspy on one of those near-saturated frames (it's C27H45+ from cholesterol, exact mass 369.351578). The intensities are different (though proportionally different), and m/z's differ by one TOF unit. Which data do I trust?

image

swillems commented 2 years ago

@swillems thanks for your response and pointing out that paper! Until last night, I thought I was approaching some decent level of understanding of TIMS data, and now I'm back to thoroughly confused. :)

I know the feeling. While this seems perhaps even trivial, it is lot more complex than one might expect. If you really want to know all the fine details, you might want to contact Bruker directly as I am not directly involved with their software/drivers/embedded pcs/....

  1. I'd like to understand better raw vs processed intensities. In the same dataset, looking at analysis.tdf we also see large numbers for intensities. Are the intensities stored in the Frames table not raw either? How/when does this processing happen?
conn = sqlite3.connect(os.path.join(TIMS_FOLDER, 'analysis.tdf'))
q = '''
SELECT * FROM Frames
ORDER BY MaxIntensity DESC
LIMIT 5
'''
pd.read_sql(q, conn)

image

Unfortunately the definition of what exectly is "raw" data is not properly defined. In any case, the numbers in the Frame table of the analysis.tdf are fully consistent with Bruker's SDK and form the basis for almost all downstream analysis tools.

However, the "binary" data present in the analysis.tdf_bin are indeed not necessarily identical. In AlphaTims, we opted to use the binary data as our starting point rather than the values as obtained from the Bruker SDK.

  1. Looking at the actual MaxIntensity numbers, I'm suspicious of saturation -- why would several frames yield precisely the same intensities for the highest peak? I tried both alphatims and opentimspy on one of those near-saturated frames (it's C27H45+ from cholesterol, exact mass 369.351578). The intensities are different (though proportionally different), and m/z's differ by one TOF unit. Which data do I trust?

image

I also think there is a lot of saturation. If you look at the accumulation times of those high-intensity frames, you will also see that they are a lot shorter than for most other frames and intuitively I think these might even act as a correction/multiplication factor. Since you mentioned you would like to contribute to AlphaTims, I would be very appreciative of a more in-depth analysis or even implementations to provide more flexible intensity values.

As far as which to "trust", I think that is highly dependant on your application. Currently almost all downstream analysis tools are compatible with the Bruker SDK, and you should have little issues using that as a high-level API. If you would like a more low-level API that purely focusses on speed, accession and visualization of the binary data, AlphaTims should be of great use as well.

liquidcarbon commented 2 years ago

Thank you Sander! This recent paper is an excellent one and it clarified some things for me: https://www.biorxiv.org/content/10.1101/2021.10.18.464737v1.full

Fig. 6 was particularly revealing, and our data looks similar:

image

liquidcarbon commented 2 years ago

What are the best practices in your group on interpreting intensities / feature calling? Do you rely on analysis.tdf_bin values or on Bruker SDK values?

Here's an illustration of why the plurality of the definition of intensity bothers me. Same NIST SRM 1950 lipidomics dataset as before. I extracted "chromatomobilogram" (XICOM) of a cholesterol + H+ - water peak. Same window of retention times and ion mobilities. As you can see, raw data (left) contains several minor features that become a lot less prominent (almost disappear) on the right.

image

Code for the plot:

atims = alphatims.bruker.TimsTOF(TIMS_FOLDER)
otims = OpenTIMS(TIMS_FOLDER)

fig, (ax1, ax2) = plt.subplots(figsize=(24,8), ncols=2)
plot_params = dict(
    kind='scatter',
    marker='x',
    lw=1.5,
    s=15,
    alpha=0.8,
    cmap='turbo',
    sharex=False,
)
(
#     atims[4000:,300:800,0,157376:157378]
    atims[9250:9950,360:450,0,157376:157378]
    .sort_values('intensity_values')
    .plot(
        ax=ax1, x='frame_indices', y='scan_indices',
        xlim=(9250, 9950), ylim=(360, 450),
        c='intensity_values',
        **plot_params,
    )
)
(
    pd.DataFrame(otims.query(frames=np.arange(9251,9950,3), columns=('frame','scan','tof','intensity')))
    .query('tof > 157375 & tof < 157378')
    .sort_values('intensity')
    .plot(
        ax=ax2, x='frame', y='scan',
        xlim=(9250, 9950), ylim=(360, 450),
        c='intensity',
        **plot_params,
    )
)
ax1.set_title('AlphaTIMS (analysis.tdf_bin raw intensities)')
ax2.set_title('OpenTIMS (Bruker SDK)')
plt.suptitle('Comparing AlphaTIMS & OpenTIMS XICOMs for $[C_{27}H_{45}]^+$')
plt.show()

UPDATE: as you correctly pointed out, the alphatims-to-opentims scaling factor appears to be the accumulation time, which is available in Frames SQLITE table.

swillems commented 2 years ago

I just had a call with Bruker about this, and indeed the accumlation time can/should be used as a correction factor. I'll include it in the next release of AlphaTims. For now I hope it is not too difficult to correct for manually...

swillems commented 2 years ago

As a quick fix, this is now available on the development branch (install with e.g. pip install "git+https://github.com/MannLabs/alphatims.git@develop#egg=alphatims[plotting]" --upgrade.

In brief: Whenever a pd.DataFrame is returned, it now includes a "corrected_intensity_values" column.

Note that the corrected intensity values cannot be used for slicing and are not displayed in the GUI directly (but they are available in the table view if this is turned on)

swillems commented 2 years ago

Dear @liquidcarbon. This fix has now been merged into the master branch and has been released in version 0.3.1. If this issue is not resolved properly, feel free to reopen this issue.