MassBank / RMassBank

Playground for experiments on the official http://bioconductor.org/packages/devel/bioc/html/RMassBank.html
Other
12 stars 15 forks source link

Added check whether peaks are in scan range #290

Closed pstahlhofen closed 3 years ago

pstahlhofen commented 3 years ago

All peaks outside the scan range are written to a file called outliers.csv Furhtermore, a warning is generated if some of the peaks are outside the range. This addresses #255

sneumann commented 3 years ago

Hi, can you comment what happens if I have spectrum1 with outliers a,b,c and another spectrum2 with outliers x,y,z ? Will I get one file with "x,y,z,a,b,c" ? I also would recommend a better filename. If one day we have outliers because of intensity (or RT or whatever) then outliers.csv is not specific enough. Or we'd need to encode the reason for the outlying. What about outliers-scanrange ? Also if you print WARNING: There were ... I recommend to put that into a warning().

meowcat commented 3 years ago

I don't think it's ideal to write to a fixed "outliers.csv" without the user's knowledge. Could we handle this instead with the archivename parameter, the same way as for _Failpeaks.csv?

If an archivename is specified, failpeaks are written to archivename_Failpeaks.csv If no archivename is specified, the failpeaks are not written to disk.

https://github.com/MassBank/RMassBank/blob/1ea06508d1736b11070dec945631a08cee57d079/R/leMsMs.r#L1350-L1352

So maybe accept an archivename parameter and write to archivename_outliers.csv?

This is still not super-ideal, because findMsMsHR.mass is a low-level rather than a workspace-level function, and wouldn't usually need to know of such a high-level property. But this has precedent in the cpdID argument which is also not strictly needed in this function.

Then, consider writing to archivename_cpdID_outliers.csv?

@sneumann your opinion?

meowcat commented 3 years ago

https://github.com/MassBank/RMassBank/blob/ccb320b2df64c4daab2e0c61e8bed0aa40aff9ba/R/leMsmsRaw.R#L348-L349

In some older files, I believe mzXML and mzData, probably also mzML that aren't made with ProteoWizard, these two columns are NA . E.g.,

list.files(system.file("spectra.Glucolesquerellin", package = "RMassBankData"), full.names = T)[2] -> t
f <- openMSfile(t)
h <- header(f)
h$scanWindowLowerLimit

Can you check for that and only do your check if those columns are not NA?

pstahlhofen commented 3 years ago

Okay, sorry for doing my minor adjustments after just reading Steffen's first remark about nrow and not looking into the other stuff yet. I will go through this once again and take care of all your suggestions. Currently all the outliers would just be stashed into one file. @sneumann Should I rather create seperate files for different spectra? And @meowcat I agree that using the archive name is sensible.

pstahlhofen commented 3 years ago

After discussing the matter with @meowcat we decided that the additional csv-file for the outliers is not strictly necessary after all. Hence, I simply returned to printing a warning. When writing the log-wrapper functionality, I can add a message for each parsed compound, so the source of the problem could be detected by the order of messages.

meowcat commented 3 years ago

Looks good to me now