Open kco-hereon opened 8 months ago
Hi Rüdiger,
Thanks for your feedback. I would definitely be interested in hearing more about your innovations and potentially incorporating them into the package. It would be nice to add support for the fluorescence data!
Re: Issue 1, This was just a workaround because I couldn't figure out where the length of the stream was encoded. It worked with all the data files I had access to, but I'm not surprised that it didn't generalize well to every file. I would definitely be interested in fixing this if you can tell me how the length of the stream is encoded. It sounds like maybe this discrepancy is due to a difference in the way the fluorometer stream is encoded compared to the PDA stream.
Re: Issue 2. To be honest, I can't remember off the top of my head where the 5 came from here. I am a little short on time right now, but I will try to figure it out and get back to you. In general though I can tell you that pretty much everything in the Shimadzu parser was worked out by trial and error because there is no publicly available information (as far as I'm aware) on how these files are encoded. It seems likely to me that this discrepancy may be due to a difference in how the fluorometry stream and the PDA stream are encoded.
Regarding your comment about the M1 processor, I'm not sure what issue you're running into, but I can tell you that the package is definitely functional on M1 macs, because I am actually doing most of the development of the package on an M1 mac. I'm guessing there is some other issue with your miniconda installation that is causing the installation to fail. To be honest, the python dependencies have been quite a headache and it really makes me wish that reticulate worked more smoothly in the context of a package. Unfortunately, for the Shimadzu LCD parser, the python bindings are pretty necessary since as far as I know there is no equivalent to olefile
in R for handling the OLE files.
Best, Ethan
Hi Ethan, here are the two Python function I can use to get the number of time sections in the PDA raw data!
the file is read in with olefile.OleFileIO !
def get_nodataset_pda_old(file):
stream = file.openstream("PDA 3D Raw Data/3D Raw Data").read()
s=stream[0:3]
count=0
for i in range(len(stream)):
if s==stream[i:i+3]:
count +=1
return count
def get_nodataset_pda(file):
stream = str(file.openstream("PDA 3D Raw Data/3D Data Item").read().decode('utf-8'))[0:1000]
num=stream[stream.find('<CN')+4:stream.find('</CN')]
In the first case, I just use the first 4 bytes of the PDA Raw Data stream which is repeated before each data block, and simply count how often it appears. In my case it was 3564 times.
Then I screen my data file (mainly by eye) and found the number '3564' in the stream "PDA 3D Raw Data/3D Data Item" which makes perfectly sense. This is an XML-type of stream similar to the one from which your code extracts the start and endtime. Unfortunately, I can not read it without an error with my XML-parser, which I can easily for many other xml streams in the same data file. I simply made a workaround and use a string operation to get the number, but this will fail in case the number would have a different length. But from this you know where to find it.
For the data of the fluorometer: these instruments are connected in an analog way to the main Shimadszu instrument. The first thing to know is at which channel it is connected to. Most likely its an early one, like in may case its Channel 1. When screening the streams of the data file there are several streams for a high number of channels, but looking on the size of each stream (many are empty) I found the data in "LSS Raw Data/Chromatogram Ch1"!
I could not find any additional information in other Channel 1 streams, e.g. for the length of the data set, which is different from the PDA data set as the instruments works with a different frequency. Luckily, the data format of the stream and its decoding is the same than for the PDA data. I can use your block decoding scheme. The differences to the PDA are logical: its only a single data set (the time series of the fluorescence of one excitation/emission channel). While each data set of the PDA data for each PDA spectrum consists of two data block, the fluorescence data have much more data blocks (in my case 18). But here we do not need a fixed order when reading the data in. I simply use a loop over all the blocks til the end of the data stream.
For the time axis I am assuming that the start and end times are the same than for the PDA.
I have not find out how the scale of the values need to be adjusted. I am getting very large values of up to 10^6 in the peaks, so I divided by 10^6.
I can directly compare the results with the data in the Shimadzu software and I am in the same range but about a factor 4 too low, while the setting of the instrument is at Gain 4, but its not exactly factor 4.
However here are my python functions for this:
def read_shimadzu_fluor_raw(file, n_lambdas=None):
pos=0
stream = file.openstream("LSS Raw Data/Chromatogram Ch1").read()
[mat, no_data] = decode_shimadzu_fluor_block(stream,pos)
return mat, no_data
def decode_shimadzu_fluor_block(fid, pos):
pos=pos+8
n_lambda = struct.unpack('<h', fid[pos:pos+2])[0]
pos=pos+4
block_length = struct.unpack('<h', fid[pos:pos+2])[0]
#print(n_lambda, block_length)
pos=pos+12
signal = [0] * (n_lambda)
count = 0
bufer = [0, 0, 0, 0]
while pos<len(fid):
n_bytes = struct.unpack('<h', fid[pos:pos+2])[0]
#print(n_bytes)
pos=pos+2
start = pos
#print('nbytes',n_bytes)
while pos < start + n_bytes:
bufer[2] = format(struct.unpack('B', fid[pos:pos+1])[0], '02x')
hex1 = int(str(bufer[2])[0],8)
pos=pos+1
if hex1 == 0:
bufer[1] = int(bufer[2],16)
elif hex1 == 1:
bin1 = format(int(bufer[2], 16),'08b')
bufer[1] = twos(bin1[4:8])
elif hex1 > 1:
no=hex1 // 2
if hex1>3:
q1=[]
for i in range(no):
q1.append(format(struct.unpack('B', fid[pos+i:pos+1+i])[0],'02x'))
q1=''.join(q1)
bufer[3]=q1
else:
bufer[3] = format(struct.unpack('B', fid[pos:pos+no])[0], '02x')
#print('test',count,hex1,bufer)
pos=pos+no
bin1 = bufer[2]+bufer[3]
#print(count, hex1,bin1)
bin1=format(int(bin1,16),'08b')
#print(bin1)
if hex1 % 2 == 0:
bufer[1] = int(bin1[2:len(bin1)], 2)
else:
bufer[1] = twos(bin1[2:len(bin1)])
bufer[0] += bufer[1]
signal[count] = bufer[0]/1000000
count += 1
end = struct.unpack('<h', fid[pos:pos+2])[0]
#print(end,pos+2)
pos=pos+2
bufer[0] = 0
return signal, len(signal)
For the cutting of the binary string at position 5 when reading the data: When looking on the maximum byte length of each data value in the delta-encode string: in case of the PDA data, this is only 3, in case of the fluorometer this is 7! The resulting bit-strings for the PDA have always zeros in position 3->5, i.e. cutting at position 5 does not change the integer value of this bit string. This situation changes when the bit length gets longer. So, in case of your code most values of the fluorometer are decoded correctly, just not when the bit length is >4 to 5.
Hope this is helpful. Let me know when there are further questions. I am now stopping for some vacation. My next step is to get information about the instruments calibration factors and spectral libraries. The spectral library is a SPC file, no clue yet how to deal with that!
Rüdiger
Thanks Rüdiger -- this looks great! I wonder if you'd be willing to share one or two test files from your instrument? I'm not sure I have an analog stream in any of the files I currently have access to. I'd definitely be interested to hear about what you find if you make any headway with the spectral libraries. Thanks again and I hope you have a nice vacation! Ethan
Hi Ethan, here are one original .lcd file from our pigment measurements and a .mat file with the data my code produces from it (very simplistic!)
Rüdiger
Thanks!
We are also working on parsing the .lcd file from Shimadzu LC-40. Using the above python code, we have extracted LSS Raw Data - Chromatogram ch1 data. Thanks for it! We are now trying to read and parse the peak table and display parameter streams ['Chromatogram Parameters', 'Display Parameter-1-1'], ['LSS Data Processing', 'PT-LC.1.1.AD.2.CH#1']. We have no clue how to proceed with decoding these streams. Any lead would be appreciated!
-Charu
Personally I haven't really looked into these streams too much -- I was mostly interested in being able to extract the data from the DAD detector -- but I would curious to hear what you figure out.
Here’s the file with all streams that I have got from OleFile module in Python. In case if anyone has any idea on how to decode it (the peak table stream - ['LSS Data Processing', 'PT-LC.1.1.AD.2.CH#1']), I’d appreciate the help. sample (1).txt
Dear Ethan, Thank you for your pioneering work enabling analysis of chromatography data. I would like to use chromConverter to parse .lcd files from our Shimadzu HPLC such as this file as follows:
data = read_shimadzu_lcd(path, format_out = "data.frame", data_format = "long", read_metadata = TRUE)
However, I get this error: Error in seq_len(n_lambda) : argument must be coercible to non-negative integer
Could you please advise? Thanks, Andy
Hi Andy, Thanks for reporting this. I was able to reproduce the error. I should have time to look into this more later in the week and hopefully track down where the problem is. Will keep you posted. Best, Ethan
Dear Ethan, Thanks so much for your message. I am delighted that you were able to reproduce the error, and my fingers are crossed that there is a straightforward solution. Please let me know what I can do to help. I am a decent R programmer, but understanding the complexities of chromConverted would be challenging for me. That is quite an R package that you wrote! Best, Andy
On Tue, Jul 30, 2024 at 11:14 AM Ethan Bass @.***> wrote:
Hi Andy, Thanks for reporting this. I was able to reproduce the error. I should have time to look into this more later in the week and hopefully track down where the problem is. Will keep you posted. Best, Ethan
— Reply to this email directly, view it on GitHub https://github.com/ethanbass/chromConverter/issues/29#issuecomment-2258597878, or unsubscribe https://github.com/notifications/unsubscribe-auth/APAF24GCEVSBO5IUMSYI6Y3ZO6UWBAVCNFSM6AAAAABETXJMTWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJYGU4TOOBXHA . You are receiving this because you commented.Message ID: @.***>
Hi Andy, I had a look at your file and the PDA stream seems to be empty? What kind of detector does your instrument have? Also do you have a screenshot (or better yet, a text file) you could share showing what the chromatogram is supposed to look like? Ethan
Hi Ethan, Thanks so much for looking into the .lcd file conversion issue. We have a Shimadzu HPLC with refractive index (RID-10A) and UV/VIS (SPD-20A) detectors. Below is an image and description of our machine. Is it unexpected for the .lcd files to have empty PDA streams?
Our HPLC setup: https://github.com/actolonen/Analysis_Lab/tree/main/HPLC
thanks! andy
On Sat, Aug 3, 2024 at 7:04 PM Ethan Bass @.***> wrote:
Hi Andy, I had a look at your file and the PDA stream seems to be empty? What kind of detector does your instrument have? Ethan
— Reply to this email directly, view it on GitHub https://github.com/ethanbass/chromConverter/issues/29#issuecomment-2267059697, or unsubscribe https://github.com/notifications/unsubscribe-auth/APAF24GJR5XSEZTMFLFJNNLZPUERVAVCNFSM6AAAAABETXJMTWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRXGA2TSNRZG4 . You are receiving this because you commented.Message ID: @.***>
-- https://www.andrewtolonen.com http://www.andrewtolonen.com
Ahh ok. that makes sense. It's not unexpected if you don't have a PDA detector, it's just that the only parser I've written so far is for the PDA stream. Luckily I think those streams use the same encoding. Does the shape of this chromatogram look right to you? I think there is a scaling factor encoded somewhere in the file -- I'm not yet sure where.
Do you perhaps have a screenshot of how the two streams (the refractive index and UV) look for the file you shared with me? Or are you expected two streams? So far I've only been able to find one stream in your file?
Hi Ethan, Yes, that chromatogram profile looks exactly like I would expect! I could get you an .lcd file and the PDF showing the chromatogram and peaks as calculated by LabSolutions. Would that help? best, andy
On Sun, Aug 4, 2024 at 3:29 AM Ethan Bass @.***> wrote:
Ahh ok. that makes sense. It's not unexpected if you don't have a PDA detector, it's just that the only parser I've written so far is for the PDA stream. Luckily I think those streams use the same encoding. Does the shape of this chromatogram look right to you? I think there is a scaling factor encoded somewhere in the file -- I'm not yet sure where.
image.png (view on web) https://github.com/user-attachments/assets/f857db7b-2995-4bd7-98e9-5d61b8f5177a
Do you perhaps have a screenshot of how the two streams (the refractive index and UV) look for the file you shared with me?
— Reply to this email directly, view it on GitHub https://github.com/ethanbass/chromConverter/issues/29#issuecomment-2267232103, or unsubscribe https://github.com/notifications/unsubscribe-auth/APAF24AIDOTM7Z7SDWMRR23ZPV7XRAVCNFSM6AAAAABETXJMTWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENRXGIZTEMJQGM . You are receiving this because you commented.Message ID: @.***>
-- https://www.andrewtolonen.com http://www.andrewtolonen.com
Yes, that would be great if it's not too much trouble. Also are you expecting there to be more than one data stream in this file?
Hi Andy , I pushed an update to the master branch that should be able to read the 2D chromatograms from your files. Please let me know if you find any issues. I believe there is a scaling factor which I have not yet been able to locate in the files, so the scale of the chromatograms may not be correct. Ethan
Hi Ethan,
I grabbed the new version of chromConverter from github and ran it on a batch of .lcd files from our HPLC. Amazing! Tthe chromatograms produced by read_shimadzu_lcd() now match those from the Shimadzu software!:
https://github.com/actolonen/Analysis_Lab/tree/main/HPLC/ChromConverter
Two things:
peak.ratio = max(data.ls$Intensity) / max(data.cc$Intensity) In my samples, the max peak height in the Lab Solutions chromatograms were always 0.3% of that of the chromConverter peak.
best, andy
On Tue, Aug 6, 2024 at 5:59 PM Ethan Bass @.***> wrote:
Hi Andy , I pushed an update to the main branch that should be able to read the 2D chromatograms from your files. Please let me know if you find any issues. I believe there is a scaling factor which I have not yet been able to locate in the files, so the scale of the chromatograms may not be correct. Ethan
— Reply to this email directly, view it on GitHub https://github.com/ethanbass/chromConverter/issues/29#issuecomment-2271631308, or unsubscribe https://github.com/notifications/unsubscribe-auth/APAF24FN3GTXDHUJZFFXAZLZQDXERAVCNFSM6AAAAABETXJMTWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRGYZTCMZQHA . You are receiving this because you commented.Message ID: @.***>
-- https://www.andrewtolonen.com http://www.andrewtolonen.com
Wonderful! The scaling factor is encoded somewhere in the file, but I haven't yet been able to figure out where it is. I hope with some more digging I can find where this value is encoded and scale the chromatograms accordingly. In another file I have from another instrument it is 0.1% so the 0.3% scaling factor is not consistent between instruments.
Regarding the two detectors, I suspect that the function should be able to provide the data from both streams, but it would be great if you can update me on that. Also If you could provide me another example file with both data streams that would be great!
Ethan
@actolonen There isn't any chance that the signal could actually be scaled by 1000 is there? (.001). I found a field that I think would make sense as the scaling factor, but it would imply that the chromatogram should be scaled by .001 rather than .003. Ethan
Hi Andy,
I just pushed a version with support for reading more of the metadata from LCD files and it also scales chromatograms by what I think is the scaling factor (.001 in your case). You should be able to toggle the scaling off by specifying scale = FALSE
.
Ethan
Hi Ethan, Fantastic! I just grabbed chromConverter 0.6.3 from github and can't wait to try it out. I am eager to test it on .lcd files that include data from both our UV and RI detectors, but the person running samples is on vacation. I will let you know how it works. Also, the 0.001 scaling factor makes sense to me. We normalize all the peak areas using standards, so the scaling factor shouldn't be critical so long as it is consistent across samples. best, andy
On Wed, Aug 14, 2024 at 7:20 PM Ethan Bass @.***> wrote:
Hi Andy, I just pushed a version with support for reading more of the metadata from LCD files and it also scales chromatograms by what I think is the scaling factor (.001 in your case). You should be able to toggle the scaling off by specifying scale = FALSE. Ethan
— Reply to this email directly, view it on GitHub https://github.com/ethanbass/chromConverter/issues/29#issuecomment-2289390116, or unsubscribe https://github.com/notifications/unsubscribe-auth/APAF24FERXK7BY7KIGP2P5TZROGU7AVCNFSM6AAAAABETXJMTWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOBZGM4TAMJRGY . You are receiving this because you were mentioned.Message ID: @.***>
-- https://www.andrewtolonen.com http://www.andrewtolonen.com
Hi Ethan, Following our success reading .lcd files with data from a single detector, I got a set of multi-channel HPLC files that contain chromatogram data from the two detectors on our HPLC . Detector A is UV/VIS SPD-20A and Detector B is refractive index RID-10A. Detector A has two channels: channel 1 is at 260 nm and channel 2 is at 210 nm.
Here are the .lcd files: https://github.com/actolonen/Analysis_Lab/tree/main/HPLC/ChromConverter/Files_LabSolutions/Files_aug24
I ran read_shimadzu_lcd() as follows:
data = read_shimadzu_lcd( path = inputfile, format_out = "data.frame", what = "chromatogram", data_format = "long", read_metadata = TRUE);
This gives the following error.
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 3359, 3360
This looks like a simple error that nrows doesn't equal ncols, but I don't know how to troubleshoot this with .lcd files. Could you please advise?
best, andy
Hi Andy,
I'm having trouble reproducing this error with the files you provided. Can you double check what version of chromConverter you're currently running? Also maybe you can run traceback()
after the error to see what is precipitating it. The first two chromatograms in all of your files are 3359 rows while the third is 3360 rows, but I am not receiving the error.
Thanks! Ethan
By the way, the new version I'm working on in the dev
branch should be faster for reading the shimadzu LCD files and also has better behavior for handling multiple chromatograms. For example, it will return them as a single data.frame instead of as a list of data.frames when data_format == long
.
Also would it be alright with you if I include one of your multi-channel shimadzu files as a test file in my chromConverterExtraTests repository?
Hi Ethan, I confirm that chromConverter works great on our multi-detector .lcd files: https://github.com/actolonen/Analysis_Lab/blob/main/HPLC/ChromConverter/2024.08_test_chromConverter.html
My error just was due to the chromatograms from the different detectors having different numbers of lines.
I would be delighted if you include one of our multi-channel .lcd files in your chromConverterExtraTests repo.
thanks! andy
Excellent. Thanks Andy!
I still don't understand why the intensities are off. I think the values exported in Shimadzu are being rounded or smoothed somehow but I can't figure out how. It's strange, because the other Shimadzu files I have access to are exact.
Hi Ethan, Just as a quick update: we are routinely using chromConverter to extract chromatograms from .lcd files using our three detectors (RID, UV-210 nm, UV-260 nm). Thanks so much for your great work! The issue of the Lab Solutions peak scaling factor is still obscure. However, we include a set of standard solutions at different concentrations in each plate that we use to quantify compound concentrations. So, my impression is that the scaling factor doesn't matter. Do you agree? best, andy
Hi Andy, Thanks for the update. I'm very glad to hear that you're finding the package useful. And yes, I agree that the scaling factor doesn't really matter for practical purposes. It is still nagging at me a little, but I am pretty stumped for the time being. all best, Ethan
On Wed, Oct 16, 2024 at 11:30 AM Andrew Tolonen @.***> wrote:
Hi Ethan, Just as a quick update: we are routinely using chromConverter to extract chromatograms from .lcd files using our three detectors (RID, UV-210 nm, UV-260 nm). Thanks so much for your great work! The issue of the Lab Solutions peak scaling factor is still obscure. However, we include a set of standard solutions at different concentrations in each plate that we use to quantify compound concentrations. So, my impression is that the scaling factor doesn't matter. Do you agree? best, andy
— Reply to this email directly, view it on GitHub https://github.com/ethanbass/chromConverter/issues/29#issuecomment-2417182271, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZEBO6ECGUOJ2JIGEPHW6DZ32BC3AVCNFSM6AAAAABETXJMTWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJXGE4DEMRXGE . You are receiving this because you were assigned.Message ID: @.***>
Dear Ethan, I tried to use your code for reading the raw data of our Shimadzu HPLC, thanks for that code!
I am not a programmer and I am mainly in Python and not in R. Here are some results from our (mine and my colleaque using R) last days working on this, I wonder whether you would like to include the issues we found for your R code.
I needed to change mainly two things:
your line 147 in read_shimadzu_lcd.R, mat <- matrix(NA, nrow = fsize/(n_lambdas*1.5), ncol = n_lambdas) This is about the size of the data stream which depends on the number of wavelength from the PDA and the total time of the HPLC run. A simple factor 1.5 does not work for my data. Instead, I first scan the PDA raw data stream for the start bits of each header of the data set and sum them up. Second, I now found the entry in a stream that contains the number of datasets and can simply be read out.
your line 249 in function decode_shimadzu_block: buffer[[2]] <- twos_complement(substr(bin, 5, nchar(bin))), This line cuts off the first 4 bits of the bit string that finally contains the number of the difference to the former value. It worked this way for my PDA data, but could not reproduce the results of the fluoremeter at some positions and distorted the signal. I needed some time to understand this but at the end the funstion simply failed when the value for the difference is a large number and mpre bytes are needed to decode it. At the end I simple reduced the cut and are using the bits from position 3. This seemed to work! My question here is: did you find the number '5' simply by trial and error, or was there a reason?
If there is interest from your side, I can spend some time to described more details, e.g. where to find the fluorescence data and how to read it or the file size in the .lcd file. Best Rüdiger