evanyeyeye / rainbow

Read chromatography and mass spectrometry binary files.
GNU General Public License v3.0
29 stars 15 forks source link

Waters 4-bytes format #1

Closed mdussere closed 8 months ago

mdussere commented 1 year ago

I get a "The 4-bytes format is not supprted" error when trying to read my waters FUNxxx.DAT files.

I have read your doc and code, and I would like to implement some 'parse_funcdat4' function. Do you have any clue ? Did you find some documentation from the constructor or is it pure reverse engineering ?

Thanks in advance and congratulation for the good work. The code is really nice.

ekwan commented 1 year ago

Can you please upload your problematic file? And no, sadly, there's no information about the file formats that's publicly available. My colleague may be able to look at the problem, but he's quite busy so you may need to look into it yourself...

mdussere commented 1 year ago

I have this batch for example

20210920 B-012.raw.zip

mdussere commented 1 year ago

In my example, I have (259 X 3) "pairs" of values of 4 bytes each.

I tried to print the binary code and it looks like the first byte is pretty much constant 10001010 or 10001011. Maybe I'm missing something because I print 28 bits instead of 32 bits

100010100111111001011100101 100010101010001101001001100 100010110101111010111010001
100010101011000000101010001 100010101100000010111100110 100010110010011011000111000
100010101100101111001111000 100010101010100111100111101 100010110100111011000010011
100010101100001001011111011 100010101100100110001111000 100010110110110010001110011
100010101010100101111000001 100010101010100101001100011 100010110100100010100000010
100010101001011111100011101 100010101000101010000000110 100010110010001101011101011
100010101011001010111111011 100010101001001110011000101 100010110011101010111001010
100010101010001100100100100 100010101010100111000011010 100010110011001011100111100
100010101001100000101001111 100010101010101010110111011 100010110011001101010000011
100010101010111111010011000 100010101000010001001000011 100010110001011101101111001
100010101001111001111101100 100010101010000111111001101 100010110101000001101000000
100010101001101011110011000 100010101010000010010101010 100010110100100101000001001
100010101000111000101000000 100010101100001011001010001 100010110110000000110101011
100010101010001000010110111 100010101100010111101111001 100010110101110001011001000
100010101011011000000101101 100010101010011100111001010 100010110101100001111100100
100010101100011100101101011 100010101100011010010011100 100010110101101110101101100
100010101010111011010011111 100010101010101100110101100 100010110101111011011110100
100010101011101000101110111 100010101100100011001000011 100010110101101010001011101
100010101010111111110101101 100010101010110001011100010 100010110101011000111000111
100010101000000001001011110 100010101010011000001010110 100010110101100110000011011
100010101010011001101011111 100010101011101110110000101 100010110101000101101100100
100010101011001111111100111 100010101011101010100010000 100010110010111010100000100
100010100111011100001110000 100010101000100001111000101 100010110100111011110101001
100010101011110101000111011 100010101100001010000000011 100010110110100000100001011
100010101010001100010010010 100010101000111101011110001 100010110110000101100001010
100010101011110011100011001 100010101011100111101111111 100010110110010001111011101
100010101010101101011101100 100010101011011010001110111 100010110110001001011111010
mdussere commented 1 year ago

I would say the "pair" is in fact only the abundance value

Maybe something like baseval 2^(powerval - 20) or baseval 2^(powerval - 8).

Do you have access to waters chromatography software to check the values ?

     100110111010110011010001110
     100110111110110110001010011
     101000100010011101010111101
     101000100100000101100000100
     101000100110101010110110101
     101000101010000001100100000
     101000101001111010110001000
     101000101011100001010111010
     101000101111000000011011010
     101000110001100011011000110
     101000110001011110000011011
     101000110011001100101111011
     101000110010110000011010001
     101000110110011001010000010
     101000111001001101100111010
     101000111101110010100011000
     101000111100010101010011100
     101010100000010010100110100
     101010100001100111000010001
     101010100011000001101000110
     101010100011100001010111101
     101010100100100000100010001
     101010100101101111111001110
     101010100110110101000010101
     101010101000100011110010011
     101010101000111011010110010
     101010101011001111010000100
     101010101011111100000101100
     101010101101111001001010100
     101010101110011001101110001
     101010101110011001001001000
     101010101111011100100001101
     101010110001011001111111111
     101010110000110011011101101
     101010110001101000011100000
     101010110000110100111100011
     101010110010001010100100000
     101010110010000100100011110
     101010110010100110011111101
     101010110100000011100110011
     101010110100101100101100001
     101010110100010110110001011
     101010110011111011100000100
mdussere commented 1 year ago

I tried on another example of data which seems very flat (many full-zero pairs) and I think one matching formula could be: baseval * 8^(powerval - 12)

| powerval +--+      baseval       |
|----------+--+--------------------|
 0000001001 01 11110101000000010010   -> 1003538 * 8^(9-12) = 1960.0
 0000001010 01 00000110001010101011   -> 25259 * 8^(10-12) = 394.7

 0000001000 01 10010110001110110101   -> 615349 * 8^(8-12) = 150.2
 0000000111 01 10010110001110110101   -> 615349 * 8^(7-12) = 18.8

 0000001010 01 01010100111100001100   -> 347916 * 8^(10-12) = 5436.1
 0000001001 01 01111011000100111000   -> 504120 * 8^(9-12) = 984.6

 0000000110 01 11101000100010011011   -> 952475  * 8^(6-12) = 3.6
|----------+--+--------------------|

I think the powerval is applied on a base 8 because when the powerval increase of 1 the baseval seems to decrease by a factor from 3 to 7 (the opposite when decreasing). It could be a base 10 but it would be at odd with the other X-bytes formats baseval * 10^(powerval -10). Base 16 could be also a candidate.

I will try to confirm when I get some screenshots of the chromatograms corresponding to the data I have.

00000010010111110101000000010010
00000010100100000110001010101011
00000010010111010110111001111011
00000010010111110111111011101111
00000010100101001011000001110000
00000000000000000000000000000000
00000000000000000000000000000000
00000010100101000100111000111100
00000000000000000000000000000000
00000010010100000010111000100111
00000010100101010100111100001100
00000010010101111011000100111000
00000000000000000000000000000000
00000001100111101000100010011011
00000010010101010000111011111101
00000000000000000000000000000000
00000000000000000000000000000000
00000000000000000000000000000000
00000001110110110111000101010110
00000001100110110111000101010110
00000000000000000000000000000000
00000010000110010110001110110101
00000001110110010110001110110101
00000000000000000000000000000000
00000010000110000000001100101011
00000010000101011100110111011100
00000010000100111001100010001101
00000000000000000000000000000000
00000000000000000000000000000000
00000010000110001001111101001110
00000001110110001001111101001110
00000000000000000000000000000000
00000010000111000000000010000101
00000000000000000000000000000000
00000010100101000111111010000011
00000010100100111001010010001100
00000000000000000000000000000000
00000010010111000000111100011111
00000000000000000000000000000000
00000000000000000000000000000000
00000000000000000000000000000000
00000000000000000000000000000000
ekwan commented 1 year ago

Great work! Please keep us posted!

mdussere commented 1 year ago

Hi @ekwan,

I think I got the formula right and produced a patch.

(baseval / 1024)  * 2^(powerval-10)

I know its the same thing as baseval * 2^(powerval-20) but I like the concept of integer part / decimal part of the baseval that evolves between 2048 an 1024.

| powerval  + - +  baseval int.dec     |
|-----------+---+----------------------|
 0000001001  0  11111010100  0000010010   -> (2052114 / 1024)  * 2^(9-10) = 1002.001
 1111111111  0  11111111111  1111111111   -> (2097151 / 1024)  * 2^(1023-10) = 1.798e+308

I don't have the rights to push a dev branch on the repo. Can you open me some limited writing rights to push my dev branch.

This is the patch I would like to propose

commit b109b0864d2852ea274b32f7a377cd0c75f326bc
Author: Michael Dussere <michael.dussere@fujitsu.fr>
Date:   Wed Nov 9 22:58:19 2022 +0100

    Add Waters .DAT 4-bytes format parsing

diff --git a/rainbow/waters/masslynx.py b/rainbow/waters/masslynx.py
index 95a9ed5..c2d6681 100644
--- a/rainbow/waters/masslynx.py
+++ b/rainbow/waters/masslynx.py
@@ -107,7 +107,7 @@ def parse_function(path, prec=0, polarity=None, calib=None):
     #     and _FUNC .DAT data format from the _FUNC .IDX file.
     # A "pair" refers to a data pair of mz-intensity or wavelength-absorbance.
     times, pair_counts, bytes_per_pair = parse_funcidx(path[:-3] + 'IDX')
-    if bytes_per_pair not in {2, 6, 8}:
+    if bytes_per_pair not in {2, 4, 6, 8}:
         raise Exception("The {bytes_per_pair}-bytes format is not supported.")

     # Extract the ylabels and data values from the _FUNC .DAT file. 
@@ -116,6 +116,8 @@ def parse_function(path, prec=0, polarity=None, calib=None):
         parse_funcdat = parse_funcdat6
     elif bytes_per_pair == 8:
         parse_funcdat = parse_funcdat8
+    elif bytes_per_pair == 4:
+        parse_funcdat = parse_funcdat4
     ylabels, data = parse_funcdat(path, pair_counts, prec, calib) 

     # Spectra without an assigned polarity always contain UV data.
@@ -201,6 +203,65 @@ def parse_funcdat2(path, pair_counts, prec=0, calib=None):

     return ylabels, data

+
+def parse_funcdat4(path, pair_counts, prec=0, calib=None):
+    """
+    Parses a Waters _FUNC .DAT file with the 4-bytes format.
+
+    This format may contain MS or UV data.
+
+    Learn more about this file format :ref:`here <funcdat6>`.
+
+    Args:
+        path (str): Path to the Waters _FUNC .DAT file.
+        pair_counts (np.ndarray):
+            1D array with the number of data pairs at each retention time.
+        prec (int, optional): Number of decimals to round ylabels.
+        calib (list, optional): Float calibration values of the spectrum.
+
+    Returns:
+        1D numpy array with ylabels. 2D numpy array with data values \
+            where the rows correspond to retention times and \
+            the columns correspond to ylabels.
+
+    """
+    # Extract the mz values from the _FUNCTNS.INF file.
+    # This code makes the assumption that in this format the
+    #     number of mz values is constant at each retention time.
+    inf_path = os.path.join(os.path.dirname(path), '_FUNCTNS.INF')
+    func_index = int(re.findall("\d+", os.path.basename(path))[0]) - 1
+    mzs = parse_funcinf(inf_path)[func_index]
+    ylabels = mzs[mzs != 0.0]
+
+    # Read most significant 4 bytes from each segment into `raw_values`.
+    with open(path, 'rb') as f:
+        raw_bytes = f.read()
+
+    # Calculate the `values` from each 4-byte segment.
+    num_datapairs = np.sum(pair_counts)
+    raw_values = np.ndarray(num_datapairs, '<I', raw_bytes)
+    
+    val_powers = raw_values >> 22             # get the first 10 bits (drop 22 on 32)
+    val_separator = (raw_values >> 21) & 0x1  # get bits at position 11
+    val_bases = raw_values & 0x1FFFFF         # keep the last 21 bits (11-32)
+    val_bases_float = val_bases / 0x400       # out of the 21 bits, the 11 first are the integer part and the 10 last are the "decimal" part
+                                              # bit at position 12 being always '1' it means that the float values
+                                              # is always between 1024 &nd 2048 (factor of 2)
+    min_val_bases = np.min(val_bases[val_bases > 0])
+    max_val_bases = np.max(val_bases)
+    val_base_ampl = max_val_bases  - 0xFFFFF
+
+    values  = val_bases_float * (2. ** np.subtract(val_powers,  10, dtype=np.int32))
+
+    # Note: We have not come across a sample with more than 1 mz value.
+    # This may need to be reshaped differently in the future.
+    data = values.reshape((pair_counts.size, ylabels.size))
+
+    del val_bases, val_powers, raw_values, raw_bytes
+
+    return ylabels, data
+
+
 def parse_funcinf(path):
     """
     Parses a Waters _FUNCTNS.INF file. 
ekwan commented 1 year ago

Cool, great work! Can you please make a pull request? Maybe @evanyeyeye can review?

mdussere commented 1 year ago

For me to make a pull request, I think I must previously create a dev branch and push it to the github origin. I don't have access rights to do that.

ekwan commented 1 year ago

I think the usual way is to fork the repository, make the changes in your fork, and then make a pull request from there.

evanyeyeye commented 1 year ago

Hi @mdussere. Sorry for taking so long to get back to you. Awesome work!

You need access rights to push to the repository, but not to make a pull request. You can follow the official guide here to make a pull request, so that I can credit you for your work.

Please let me know if you have any questions. Thanks for your contribution!

ekwan commented 8 months ago

Closed by PR #10.