evanyeyeye / rainbow

Read chromatography and mass spectrometry binary files.
GNU General Public License v3.0
28 stars 14 forks source link

Data File from MassHunter ICP-MS #25

Open houriganUCSC opened 3 months ago

houriganUCSC commented 3 months ago

Evan and Eugene,

I was wondering if you all had ever experimented with opening data from an Agilent 7900 or 8900 inductively coupled plasma mass spectrometer. I am trying to access the time-series data from both the pulse and analog detectors for modeling detector linearity.

Thanks for all of your hard work on rainbow-api. What a feat!

For the data files coming off our 8900, we do not have a file called 'MSMassCal.bin'

Instead we have: MSScan_XSpecific.xsd - much like the MSScan.xsd this file likely contains a description of the complex types in the .bin file of the same name

MSScan_XSpecific.bin

MSTS_XSpecific.xml This file contains integer masses, element names, and accumulation times for each isotope.

I think I can modify your code to open these files but I was wonder if you wanted first crack at it. I am happy to provide data files.

Thanks for your work.

ekwan commented 3 months ago

We are working on decoding some QTOF data right now. Seems to be a similar issue. We would welcome your help or files!

houriganUCSC commented 3 months ago

Should I just drop the directory here? I will start with a simple example with only one "tune mode". It turns out that *.d directories can have multiple tune modes in them for analyses that utilize a variety of reaction / collision gases. The attached directory is actually from a GC-ICP-MS system (Agilent 8890 couple to an Agilent 8900 "triple-quadrupole" ICP-MS). There are no reaction gases used in this file. I can send more complex files later

001SMPL.d.zip

ekwan commented 3 months ago

Thanks. We probably won't be able to look at this for some time, but please feel free to make a pull request if you figure it out in the meantime!

houriganUCSC commented 3 months ago

It looks like there are num_times (source: MSTS.xml) entries comprising num_masses (source: MSTS_XSpecific.xml) of the complex data type, "IonRecordType" (source: MSScan_XSpecific.xsd) starting at an offset of 0x48 bytes.

f = open(os.path.join(path, "MSScan_XSpecific.bin"), 'rb')
f.seek(0x48) # start offset
XSpecific_info = np.empty([num_times, num_masses], dtype=object)
    for i in range(num_times):
        for j in range(num_masses):
            # "IonRecordType" is always the name of the root "complex" type.
            result = read_complextype(f, complextypes_dict, "IonRecordType")
            XSpecific_info[i,j] = result
f.close()

Of course, what these data mean is another huge question.

In addition, the MSSCan.xsd file does not have a "UncompressedByteCount" value in the complex data type. So that is prevents the decompression part of the code from working.

houriganUCSC commented 3 months ago

I've figured out the MSProfile.bin structure.

  1. There is no lzw compression. 2 The Offset for each cycle are stored in the SpectrumParameterType as read from MSScan.bin.
  2. The PointCount for each cycle is stored in the SpectrumParameterType as read from MSScan.bin.
  3. Each point yields 4 data items:
    • 4 bytes: Integer(?) mass
    • 8 bytes: double-precision float reported value
    • 8 bytes: double-precision float pulse counter value
    • 8 bytes: double-precision float analog detector value

However, the data are distributed as

I have created a different function that implements this parsing algorithm "parse_icpmsdata(*args)". Once I have this all working cleanly in the format that you all use, I will submit a pull request.

ekwan commented 3 months ago

Wow, cool! Please also include some documentation and tests for this. I have a feeling this format generalizes behind your particular instrument. Thank you for your hard work!

On Wed, May 22, 2024 at 8:24 PM houriganUCSC @.***> wrote:

I've figured out the MSProfile.bin structure.

  1. There is no lzw compression. 2 The Offset for each cycle are stored in the SpectrumParameterType as read from MSScan.bin.
  2. The PointCount for each cycle is stored in the SpectrumParameterType as read from MSScan.bin.
  3. Each point yields 4 data items:

    • 4 bytes: Integer(?) mass
    • 8 bytes: double-precision float reported value
    • 8 bytes: double-precision float pulse counter value
    • 8 bytes: double-precision float analog detector value

However, the data are distributed as

  • Masses: PointCount * 4 bytes
  • Reported values: PointCount * 8 bytes
  • Pulse values: PointCount * 8 bytes
  • Analog values: PointCount * 8 bytes

I have created a different function that implements this parsing algorithm "parse_icpmsdata(*args)". Once I have this all working cleanly in the format that you all use, I will submit a pull request.

— Reply to this email directly, view it on GitHub https://github.com/evanyeyeye/rainbow/issues/25#issuecomment-2125982682, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOBNIRSRO7HYLPTUWVQKLDZDUZMDAVCNFSM6AAAAABH7AMLF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRVHE4DENRYGI . You are receiving this because you commented.Message ID: @.***>

houriganUCSC commented 3 months ago

There are definitely some additional details to be worked out. The files that I have been testing come from an Agilent 7700 single quadrupole ICP-MS and an Agilent 8900 triple quadrupole ICP-MS with a collision reaction cell. I am working in files in time-resolved analysis mode with only one measurement per isotope. The data can be collected with multiple channels per peak. I will look at that today. In addition, a single *.d file can contain data from one or more tunes. Here a "tune" refers to different instrument settings related to working with a collision / reaction gas in the cell between the two mass-resolving quadrupoles. Each tune file can have a different set of isotopes (mz). I will work on this stuff today.

Here's the logic that I added to masshunter.py to detect ICP-MS files

    acqdata_files = set(os.listdir(acqdata_path))
    if {"MSTS.xml", "MSScan.xsd", "MSScan.bin"} <= acqdata_files:
        if "MSScan_XSpecific.bin" in acqdata_files:
            datafiles.append(parse_icpmsdata(acqdata_path, prec))
        elif "MSProfile.bin" in acqdata_files:
            datafiles.append(parse_msdata(acqdata_path, prec))

Here is the current parser. It would probably be cleaner to embed the logic above in parse_msdata. I haven't converted the parsed data into the DataFile object yet. I think most folks would just ignore the raw pulse and analog values and read in the reported values (rep_vals) which are the ones that appear in the csv file.

My specific interest in parsing these files is to understand the coevolution of the raw pulse and analog values during analytical session. We use these data to model drift and non-linear response of our secondary electron multiplier detectors.

def parse_icpmsdata(path, prec=0):
    """
    Parses Masshunter MS data.

    IMPORTANT: This method only supports parsing MSProfile.bin.  It has not been tested for multiple tune modes or 
    multiple measurements per mz

    The following files are used (in order of listing):
        - MSTS.xml -> Number of retention times.
        - MSTS_XSpecifc.xml -> Number of masses
        - MSScan.xsd -> File structure of MSScan.bin.
        - MSScan.bin -> Offsets and PointCounts info for MSProfile.bin.
        - MSScan_XSpecific.xsd -> File structure of MSScan_XSpecific.bin
        - MSSCan_XSpecific.bin -> Integer masses and detector modes for scans
        - MSProfile.bin -> Actual data values.

    Learn more about this file format :ref:`here <hrms>`.

    Args:
        path (str): Path to the AcqData subdirectory.
        prec (int, optional): Number of decimals to round mz values.

    Returns:
        DataFile containing Masshunter MS data.

    """
    # MSTS.xml: Extract number of retention times.
    # In future work, this step could potentially be skipped by reading
    #    MSScan.bin until EOF and counting the offsets.
    # This would remove reliance on MSTS.xml being in the directory.
    tree = etree.parse(os.path.join(path, "MSTS.xml"))
    root = tree.getroot()
    num_times = 0
    for time_seg in root.findall("TimeSegment"):
        num_times += int(time_seg.find("NumOfScans").text)

    # MSTS_XSpecific.xml: Extract number of masses.
    tree = etree.parse(os.path.join(path, "MSTS_XSpecific.xml"))
    root = tree.getroot()
    num_masses = 0
    for ir in root.findall("IonRecord"):
        for mass in ir.findall("Masses"):
            num_masses += 1

    # MSScan.xsd: Extract the file structure of MSScan.bin. 
    # There are "simple" types can be directly translated into number types.
    # But there are "complex" types that are made up of other
    #     "simple" and "complex" types.
    # These are stored in a dictionary to enable recursive parsing. 
    tree = etree.parse(os.path.join(path, "MSScan.xsd"))
    root = tree.getroot()
    namespace = tree.xpath('namespace-uri(.)') 
    complextypes_dict = {}
    for complextype in root.findall(f"{{{namespace}}}complexType"):
        innertypes = []
        for element in complextype[0].findall(f"{{{namespace}}}element"):
            innertypes.append((element.get('name'), element.get('type')))
        complextypes_dict[complextype.get('name')] = innertypes

    # MSScan.bin: Extract information about MSProfile.bin.
    # For each retention time, this includes:
    #   - the retention time itself
    #   - number of masses recorded at the retention time
    #   - starting offset of data in MSProfile.bin
    #   - length in bytes of the uncompressed data

    # Future work could determine if the data blocks for each retention time
    # Unlike msdata, icpms data is not lzw compressed

    f = open(os.path.join(path, "MSScan.bin"), 'rb')
    f.seek(0x58) # start offset
    f.seek(struct.unpack('<I', f.read(4))[0])
    data_info = np.empty(num_times, dtype=object)
    for i in range(num_times):
        # "ScanRecordType" is always the name of the root "complex" type.
        scan_info = read_complextype(f, complextypes_dict, "ScanRecordType")
        spectrum_info = scan_info['SpectrumParamValues']
        data_info[i] = (
            scan_info['ScanTime'],
            spectrum_info['PointCount'],
            spectrum_info['SpectrumOffset'],
            spectrum_info['ByteCount'])
    f.close()

    # MSScan_XSpecific.xsd: Extract the file structure of MSScan.bin.
    # There are "simple" types can be directly translated into number types.
    # But there are "complex" types that are made up of other
    #     "simple" and "complex" types.
    # These are stored in a dictionary to enable recursive parsing.
    tree = etree.parse(os.path.join(path, "MSScan_XSpecific.xsd"))
    root = tree.getroot()
    namespace = tree.xpath('namespace-uri(.)')
    complextypes_dict = {}
    for complextype in root.findall(f"{{{namespace}}}complexType"):
        innertypes = []
        for element in complextype[0].findall(f"{{{namespace}}}element"):
            innertypes.append((element.get('name'), element.get('type')))
        complextypes_dict[complextype.get('name')] = innertypes

    # MSScan_XSpecific.bin: Extract information about XValues and IonDetectorMode
    # For for each retention time, each mass includes:
    #   - XValue integer mass for PA Factor
    #   - IonDetectorMode: detector configuration - 1 = P or A, 2 Forced Analog
    f = open(os.path.join(path, "MSScan_XSpecific.bin"), 'rb')
    f.seek(0x48) # start offset
    XSpecific_info = np.empty([num_times, num_masses], dtype=object)
    for i in range(num_times):
        for j in range(num_masses):
        # "ScanRecordType" is always the name of the root "complex" type.
            result = read_complextype(f, complextypes_dict, "IonRecordType")
            XSpecific_info[i,j] = result
    f.close()

    # MSProfile.bin: Extract the data values.
    # The raw bytes are uncompressed
    # This code is demostrate to work for files from both the Agillent 7700
    # (single quad) and the Agilent 8900 (triple quadrupole).  Note: that the test
    # files have only one tune mode (reaction gas) and are TRA data with one sample per
    # mz. Other files may have more tune modes and / or more samples per mz.
    # Future work necessary to sort out these complexities.
    f = open(os.path.join(path, "MSProfile.bin"), 'rb')
    times = np.empty(num_times)
    num_mz_per_time = np.empty(num_times)
    mz_list = []
    masses = np.empty([num_times, num_masses], dtype=int)
    rep_vals = np.empty([num_times, num_masses], dtype=float)
    pul_vals = np.empty([num_times, num_masses], dtype=float)
    ana_vals = np.empty([num_times, num_masses], dtype=float)
    for i in range(num_times):
        time, num_mz, offset, comp_len = data_info[i]
        times[i] = time
        num_mz_per_time[i] = num_mz

        # Decompress the bytes for the current retention time.
        f.seek(offset)
        # Read num_mz of 4 byte mass value
        masses[i,:] = [val[0] for val in struct.iter_unpack('<I', f.read(num_mz * 4))]

        # read num_mz of double-precision float reported values
        rep_vals[i,:] = [val[0] for val in struct.iter_unpack('<d', f.read(num_mz * 8))]

        # read num_mz of double-precision float raw pulse count values (note: -1 indicates over-range value)
        pul_vals[i,:] = [val[0] for val in struct.iter_unpack('<d', f.read(num_mz * 8))]

        # read num_mz of double-precision float analog values.  If the pulse counter is over-range, the analog value
        #  is multiplied by the PAReference value in the MethPAFactor.xml for that mz value to produce
        # and extrapolated analog-detector intensity measurement in counts per second. 
        ana_vals[i,:] = [val[0] for val in struct.iter_unpack('<d', f.read(num_mz * 8))]
houriganUCSC commented 1 week ago

Evan and Eugene,

I have completed a version of the code that I am going to add to my detector cross-calibration modeling software. I am adding you two to the copyrights in my code:

"Copyright (c) 2022, Evan Shi and Eugene Kwan. Rainbow-api, Revision 0cb50254"

In addition I am adding a shout out to you both in the acknowledgments of a Journal of Analytical Atomic Spectrometry technical note that I am submitting next week. I'd like to give you more credit in the manuscript; the easiest way to do this would be to reference an archive of the rainbow-api code at Zenodo which provides a DOI for each code submission. Let me know if that is something that you are interested in.

Thanks for all your hard work

ekwan commented 1 week ago

Thank you! That all sounds very reasonable and nice!

On Sun, Aug 11, 2024 at 12:22 AM houriganUCSC @.***> wrote:

Evan and Eugene,

I have completed a version of the code that I am going to add to my detector cross-calibration modeling software. I am adding you two to the copyrights in my code:

"Copyright (c) 2022, Evan Shi and Eugene Kwan. Rainbow-api, Revision 0cb5025 https://github.com/evanyeyeye/rainbow/commit/0cb50254b0043a4d14bdce8e42d548ce89eaf5d8 "

In addition I am adding a shout out to you both in the acknowledgments of a Journal of Analytical Atomic Spectrometry technical note that I am submitting next week. I'd like to give you more credit in the manuscript; the easiest way to do this would be to reference an archive of the rainbow-api code at Zenodo which provides a DOI for each code submission. Let me know if that is something that you are interested in.

Thanks for all your hard work

— Reply to this email directly, view it on GitHub https://github.com/evanyeyeye/rainbow/issues/25#issuecomment-2282299299, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABOBNIWN2NFJJFZ5UII56F3ZQ2HBNAVCNFSM6AAAAABH7AMLF6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBSGI4TSMRZHE . You are receiving this because you commented.Message ID: @.***>