ethanbass / chromConverter

Parsers for chromatography data in R (HPLC-DAD/UV, GC-FID, MS)
https://ethanbass.github.io/chromConverter/
GNU General Public License v3.0
25 stars 3 forks source link

Reading shimadzu .lcd files #29

Open kco-hereon opened 4 months ago

kco-hereon commented 4 months ago

Dear Ethan, I tried to use your code for reading the raw data of our Shimadzu HPLC, thanks for that code!

I am not a programmer and I am mainly in Python and not in R. Here are some results from our (mine and my colleaque using R) last days working on this, I wonder whether you would like to include the issues we found for your R code.

  1. We are both using a new Mac with a M1 chip: the standard installation, some dependencies in Miniconda are not provided for this processor architecture! So, we could not use your code directly but did also not spend time to figure out the details of which dependency is failing. My colleaque used an older PC instead.
  2. Instead I am trying to translate the codes for reading shimadzu-.lcd files in Python. This code is now (most likely) working for my data and I can also extract the data of our fluorometer in addition to that of the PDA.

I needed to change mainly two things:

  1. your line 147 in read_shimadzu_lcd.R, mat <- matrix(NA, nrow = fsize/(n_lambdas*1.5), ncol = n_lambdas) This is about the size of the data stream which depends on the number of wavelength from the PDA and the total time of the HPLC run. A simple factor 1.5 does not work for my data. Instead, I first scan the PDA raw data stream for the start bits of each header of the data set and sum them up. Second, I now found the entry in a stream that contains the number of datasets and can simply be read out.

  2. your line 249 in function decode_shimadzu_block: buffer[[2]] <- twos_complement(substr(bin, 5, nchar(bin))), This line cuts off the first 4 bits of the bit string that finally contains the number of the difference to the former value. It worked this way for my PDA data, but could not reproduce the results of the fluoremeter at some positions and distorted the signal. I needed some time to understand this but at the end the funstion simply failed when the value for the difference is a large number and mpre bytes are needed to decode it. At the end I simple reduced the cut and are using the bits from position 3. This seemed to work! My question here is: did you find the number '5' simply by trial and error, or was there a reason?

If there is interest from your side, I can spend some time to described more details, e.g. where to find the fluorescence data and how to read it or the file size in the .lcd file. Best Rüdiger

ethanbass commented 4 months ago

Hi Rüdiger,

Thanks for your feedback. I would definitely be interested in hearing more about your innovations and potentially incorporating them into the package. It would be nice to add support for the fluorescence data!

Re: Issue 1, This was just a workaround because I couldn't figure out where the length of the stream was encoded. It worked with all the data files I had access to, but I'm not surprised that it didn't generalize well to every file. I would definitely be interested in fixing this if you can tell me how the length of the stream is encoded. It sounds like maybe this discrepancy is due to a difference in the way the fluorometer stream is encoded compared to the PDA stream.

Re: Issue 2. To be honest, I can't remember off the top of my head where the 5 came from here. I am a little short on time right now, but I will try to figure it out and get back to you. In general though I can tell you that pretty much everything in the Shimadzu parser was worked out by trial and error because there is no publicly available information (as far as I'm aware) on how these files are encoded. It seems likely to me that this discrepancy may be due to a difference in how the fluorometry stream and the PDA stream are encoded.

Regarding your comment about the M1 processor, I'm not sure what issue you're running into, but I can tell you that the package is definitely functional on M1 macs, because I am actually doing most of the development of the package on an M1 mac. I'm guessing there is some other issue with your miniconda installation that is causing the installation to fail. To be honest, the python dependencies have been quite a headache and it really makes me wish that reticulate worked more smoothly in the context of a package. Unfortunately, for the Shimadzu LCD parser, the python bindings are pretty necessary since as far as I know there is no equivalent to olefile in R for handling the OLE files.

Best, Ethan

kco-hereon commented 4 months ago

Hi Ethan, here are the two Python function I can use to get the number of time sections in the PDA raw data!

the file is read in with olefile.OleFileIO !

def get_nodataset_pda_old(file):
    stream = file.openstream("PDA 3D Raw Data/3D Raw Data").read()
    s=stream[0:3]
    count=0
    for i in range(len(stream)):
        if s==stream[i:i+3]:
            count +=1
    return count

def get_nodataset_pda(file):
    stream = str(file.openstream("PDA 3D Raw Data/3D Data Item").read().decode('utf-8'))[0:1000]
    num=stream[stream.find('<CN')+4:stream.find('</CN')]

In the first case, I just use the first 4 bytes of the PDA Raw Data stream which is repeated before each data block, and simply count how often it appears. In my case it was 3564 times.

Then I screen my data file (mainly by eye) and found the number '3564' in the stream "PDA 3D Raw Data/3D Data Item" which makes perfectly sense. This is an XML-type of stream similar to the one from which your code extracts the start and endtime. Unfortunately, I can not read it without an error with my XML-parser, which I can easily for many other xml streams in the same data file. I simply made a workaround and use a string operation to get the number, but this will fail in case the number would have a different length. But from this you know where to find it.

For the data of the fluorometer: these instruments are connected in an analog way to the main Shimadszu instrument. The first thing to know is at which channel it is connected to. Most likely its an early one, like in may case its Channel 1. When screening the streams of the data file there are several streams for a high number of channels, but looking on the size of each stream (many are empty) I found the data in "LSS Raw Data/Chromatogram Ch1"!

I could not find any additional information in other Channel 1 streams, e.g. for the length of the data set, which is different from the PDA data set as the instruments works with a different frequency. Luckily, the data format of the stream and its decoding is the same than for the PDA data. I can use your block decoding scheme. The differences to the PDA are logical: its only a single data set (the time series of the fluorescence of one excitation/emission channel). While each data set of the PDA data for each PDA spectrum consists of two data block, the fluorescence data have much more data blocks (in my case 18). But here we do not need a fixed order when reading the data in. I simply use a loop over all the blocks til the end of the data stream.

For the time axis I am assuming that the start and end times are the same than for the PDA.

I have not find out how the scale of the values need to be adjusted. I am getting very large values of up to 10^6 in the peaks, so I divided by 10^6.

I can directly compare the results with the data in the Shimadzu software and I am in the same range but about a factor 4 too low, while the setting of the instrument is at Gain 4, but its not exactly factor 4.

However here are my python functions for this:

def read_shimadzu_fluor_raw(file, n_lambdas=None):
    pos=0
    stream = file.openstream("LSS Raw Data/Chromatogram Ch1").read()
    [mat, no_data] = decode_shimadzu_fluor_block(stream,pos)
    return mat, no_data

def decode_shimadzu_fluor_block(fid, pos):
    pos=pos+8
    n_lambda = struct.unpack('<h', fid[pos:pos+2])[0]
    pos=pos+4
    block_length = struct.unpack('<h', fid[pos:pos+2])[0]
    #print(n_lambda, block_length)
    pos=pos+12
    signal = [0] * (n_lambda)
    count = 0
    bufer = [0, 0, 0, 0]
    while pos<len(fid):
        n_bytes = struct.unpack('<h', fid[pos:pos+2])[0]
        #print(n_bytes)
        pos=pos+2
        start = pos
        #print('nbytes',n_bytes)
        while pos < start + n_bytes:
            bufer[2] = format(struct.unpack('B', fid[pos:pos+1])[0], '02x')
            hex1 = int(str(bufer[2])[0],8)
            pos=pos+1
            if hex1 == 0:
                bufer[1] = int(bufer[2],16)
            elif hex1 == 1:
                bin1 = format(int(bufer[2], 16),'08b')
                bufer[1] = twos(bin1[4:8])
            elif hex1 > 1:
                no=hex1 // 2
                if hex1>3:
                    q1=[]
                    for  i in range(no):
                        q1.append(format(struct.unpack('B', fid[pos+i:pos+1+i])[0],'02x'))
                    q1=''.join(q1)
                    bufer[3]=q1
                else:
                    bufer[3] = format(struct.unpack('B', fid[pos:pos+no])[0], '02x')
                    #print('test',count,hex1,bufer)
                pos=pos+no
                bin1 = bufer[2]+bufer[3]
                #print(count, hex1,bin1)
                bin1=format(int(bin1,16),'08b')
                #print(bin1)

                if hex1 % 2 == 0:
                    bufer[1] = int(bin1[2:len(bin1)], 2)   
                else:
                    bufer[1] = twos(bin1[2:len(bin1)])    

            bufer[0] += bufer[1]
            signal[count] = bufer[0]/1000000   
            count += 1
        end = struct.unpack('<h', fid[pos:pos+2])[0]
        #print(end,pos+2)
        pos=pos+2
        bufer[0] = 0
    return signal, len(signal)

For the cutting of the binary string at position 5 when reading the data: When looking on the maximum byte length of each data value in the delta-encode string: in case of the PDA data, this is only 3, in case of the fluorometer this is 7! The resulting bit-strings for the PDA have always zeros in position 3->5, i.e. cutting at position 5 does not change the integer value of this bit string. This situation changes when the bit length gets longer. So, in case of your code most values of the fluorometer are decoded correctly, just not when the bit length is >4 to 5.

Hope this is helpful. Let me know when there are further questions. I am now stopping for some vacation. My next step is to get information about the instruments calibration factors and spectral libraries. The spectral library is a SPC file, no clue yet how to deal with that!

Rüdiger

ethanbass commented 4 months ago

Thanks Rüdiger -- this looks great! I wonder if you'd be willing to share one or two test files from your instrument? I'm not sure I have an analog stream in any of the files I currently have access to. I'd definitely be interested to hear about what you find if you make any headway with the spectral libraries. Thanks again and I hope you have a nice vacation! Ethan

kco-hereon commented 4 months ago

Archiv.zip

Hi Ethan, here are one original .lcd file from our pigment measurements and a .mat file with the data my code produces from it (very simplistic!)

Rüdiger

ethanbass commented 4 months ago

Thanks!

charumeenah commented 1 month ago

We are also working on parsing the .lcd file from Shimadzu LC-40. Using the above python code, we have extracted LSS Raw Data - Chromatogram ch1 data. Thanks for it! We are now trying to read and parse the peak table and display parameter streams ['Chromatogram Parameters', 'Display Parameter-1-1'], ['LSS Data Processing', 'PT-LC.1.1.AD.2.CH#1']. We have no clue how to proceed with decoding these streams. Any lead would be appreciated!

-Charu

ethanbass commented 1 month ago

Personally I haven't really looked into these streams too much -- I was mostly interested in being able to extract the data from the DAD detector -- but I would curious to hear what you figure out.

charumeenah commented 1 month ago

Here’s the file with all streams that I have got from OleFile module in Python. In case if anyone has any idea on how to decode it (the peak table stream - ['LSS Data Processing', 'PT-LC.1.1.AD.2.CH#1']), I’d appreciate the help. sample (1).txt