bovee / entab

* -> TSV
MIT License
21 stars 5 forks source link

bug in agilent .uv parser #28

Closed ethanbass closed 2 years ago

ethanbass commented 2 years ago

I was investigating the UV parser more and I think there are still some problems. For example, I was trying to import a UV file from my lab and it looks pretty good for about the first 15 minutes, but then the baseline starts going all over the place. Any idea what might be going on? I'm attaching a picture of the entab imported file in black and the CSV I exported from chemstation in blue. image

The example file that ships with entab doesn't look too good either: image

Below is the code to reproduce what I did in R. You can find the file I tried to convert and the CSV version here https://cornell.box.com/v/example-DAD-files . Thanks! Ethan

library(entab)
files[1]
path <- "~/Library/CloudStorage/Box-Box/kessler-data/lactuca/botrytis_experiment/data/lettuce_roots/ETHAN_01_19_21 2021-01-20 00-27-52/679.D/dad1.uv"
r <- as.data.frame(Reader(path))
ch.entab <- data.frame(tidyr::pivot_wider(r, id_cols = "time",
                        names_from = "wavelength", values_from = "intensity"))

ch.csv <- read.csv("~/Library/CloudStorage/Box-Box/kessler-data/lactuca/botrytis_experiment/data/lettuce_roots/export3D/EXPORT3D_ETHAN_01_19_21 2021-01-20 00-27-52/679.CSV",
                   row.names = 1, header=TRUE,
                   fileEncoding="utf-16",check.names = FALSE)
par(mfrow=c(1,1))
matplot(ch.entab$time, ch.entab[,"X280"], type="l",ylim=c(-100,800))
matplot(ch.entab$time, ch.csv[,"280.00000"],type="l",add=T,lty=2,col="blue")
abline(v="15.00",col="red",lty=3)

example_file <- as.data.frame(Reader("~/entab/entab/tests/data/carotenoid_extract.d/dad1.uv"))
library("tidyverse")
df %>% filter(wavelength=="200")
df <- data.frame(tidyr::pivot_wider(example_file, id_cols = "time", names_from = "wavelength", values_from = "intensity"))
matplot(df$time, df$X280, type="l")
ethanbass commented 2 years ago

I tried the aston parser. it works beautifully! image

bovee commented 2 years ago

I'm not sure if this was the issue (I haven't checked the graphs yet), but there's definitely a bug where it was pulling an unsigned int instead of a signed one (fixed in 7b751f51b5327fba1c3781ff1d28ead7fa37d760). I vaguely remember a bug like this happening in Aston too a long time ago too so it's possible there's still something else.

ethanbass commented 2 years ago

Thanks for looking into this. Your example file now seems to be reading correctly, but my file 679.D still has the crazy shifting baseline in both versions (CLI and entab-R). Also, in the R version something there seems to be a newly introduced bug where there are some values making it into the wavelength column that appear to be retention times (but this doesn't happen in the CLI version).

Also, I don't have benchmarks, but it seems like something you did slowed down the R version considerably. I'm not sure if this could be related to the retention times appearing with the wavelengths. The slowdown only seems to affect the chemstation UV parser. The masshunter parser, for example, is working beautifully from what I can tell.

bovee commented 2 years ago

I think the R slowness/bad data is unrelated to the UV parsing, but might be from 622c0363ee1b0332240b10846426c00570e7e9a2 ? It's extremely weird.

Thank you for the UV data BTW! I took a quick look and I think there are still two things going on:

  1. The values between Aston and Entab start the same, but go off track after the first record so there's a parsing bug around file lengths I'll try to track down.

  2. Both of their values are (very slightly) different from the CSV. I think there's a multiplier or offset in the header that they need to be corrected by?

bovee commented 2 years ago

I refactored the UV parser a bit in 14059d2acb181b1fe6cd09e0f661bca303934392 and I think both of these issues should be fixed (and there should be metadata available on these files now).

I'm still not sure what's happening with the R bindings, but I can futz with it. You might also try deleting the current ones before reinstalling?

ethanbass commented 2 years ago

awesome this is great!!! I tried removing the R package before reinstalling as you suggested and it seems to have helped dramatically with the speed. This also seems to have improved the issue I mentioned with retention times appearing in the wavelengths column (about 9/10 times). The weird part though (!?) is that this behavior is still happening about one tenth of the time, as in, if I repeatedly run the Reader on the same file. :monocle_face: (This seems to be independent of the file used). Also i'm pretty confident that the speed issue is somewhat related to this behavior. It runs much slower on the runs where it ends up producing the wrong values

ethanbass commented 2 years ago

Also re: metadata I'm not quite sure what kind of metadata there should be or how to access it?

bovee commented 2 years ago

I opened a new bug (#29) for the retention time crossover issue to track that on its own since it's weird and I don't fully understand it.

Some of the file parsers read additional metadata out (e.g. sample name, operator name, etc) if the file contains it and I've figured out the format; you can access it with the -m flag on the CLI or in R with Reader(path)$metadata().

ethanbass commented 2 years ago

sounds good!!