MassBank / RMassBank

Playground for experiments on the official http://bioconductor.org/packages/devel/bioc/html/RMassBank.html
Other
12 stars 15 forks source link

Processing of tentatives/unknowns #42

Closed sneumann closed 1 year ago

sneumann commented 10 years ago

Issue by schymane from Monday Oct 28, 2013 at 13:42 GMT Originally opened as https://github.com/sneumann/RMassBank/issues/34


This relies on other discussions to set the tagging. TBC.

schymane commented 9 years ago

Talk about this after BioC release

schymane commented 9 years ago

Erik and I also discussed this today, a summary: Michele and I both feel that MetFrag in RMassBank is out of the scope, certainly for the moment. More fundamental things that need to be addressed first to enable RMassBank to even create tentative/unknown spectra suitable for upload to MassBank. Assuming the user knows why they want the spectra and how they identified it we need the right comment fields, the correct way to represent the results, what do we do without structures, how to skip the recalibration and formula annotation without a formula available ... (but still do this if we do have a formula but not structure). Case 1: we have a structure but it's not certain => this is only a comment in the MassBank record Case 2: we aren't exactly sure about the structure but we have a formula => can still perform most aspects of the workflow Case 3: we have an exact mass with a cool spectrum that we want to share with the world, but we can't identify it Strategy: RMassBank works off a compound list where the user gives SMILES (best case) or for tentatives/unknowns either a formula or mass along with a comment field and RMassBank takes this information and prepares the records as best it can (note, we probably need to deal with mixed lists...) Once we have the records there is absolutely nothing stopping anyone doing MetFrag on them ... we actually almost have all the bits and pieces ready but this is a different connection/scope and in my opinion the job of another package and not RMassBank at the moment.

ermueller commented 8 years ago

So, this is theoretically finished, but it needs cleanup and stuff like that. Docs are already in there, but this is all subject to change, since I already typed up a list for myself on where to clean up some code.

You'll all get a mail today with everything, I'll just test this a bit more and then type up some instructions.

Also, add a vignette for tentatives/unknowns? I feel that we have to. But I'd make that a separate issue.

EDIT: Just to clarify, I'll push the "unclean" version for now - the cleanup stuff would take quite some time.

schymane commented 8 years ago

Great! I agree that we need a vignette and that it should be a separate issue. Look forward to testing it ;)

ermueller commented 8 years ago

I hope you all read this, since my uni mailserver half-crashed 5 minutes ago:

I forgot the attachment in my mail, but you can install the package from git from my master branch. https://github.com/ermueller/RMassBank

schymane commented 8 years ago

Minor feedback to your email point 3: @ermueller a) If you want to process unknowns you need an additional column "m/z" with your target m/z. => Michele and I would be very happy if you take "mz" instead of "m/z" to avoid ugly conversion issues downstream.

ermueller commented 8 years ago

No problem, I'll rework that. Can both be valid? :)

Also, I'll probably autodetect what you can do with a certain compound list and give the user an output when the list is loaded. Something like "This compoundlist can be used for the following retrieval methods: standard, tentative"

Also, I'll save that somewhere internally and make workflow retrieval methods unable if the compoundlist loaded is not appropriate. Sounds good?

schymane commented 8 years ago

I’m just trying to wrap my head around your way of building it, because I envisaged doing it all together. Two conceptually challenging bits: We’ll have to transfer the recalibration to the tentatives and unknowns from the standard workflow We have the problem that we now have several runs for one dataset and this is a teeny tiny problem at the record generation stage because the list file is overwritten, see #145. I’m out of time for now but can get you more feedback tonight or tomorrow.

ermueller commented 8 years ago

1) Transfer recalibration for unknowns (without annotation)? How would you calculate (and adjust) dppm for peaks if there are no formulas? Recalibration should work for tentatives.

2) Several runs for one dataset? I'm not sure I understand. Different in what respect? I can def program a multiprocessing in for pretty much every option, as long as the resulting records would have a different accession.

EDIT: Oh yeah, now I get it, you want the "level" column added and used for every compound differently, right? I wanted to do that, too - just wanted a more general solution first. I'm incredibly sorry that it turned out to be a hassle for you. :( I'll try my best to implement that asap.

schymane commented 8 years ago

Yeah so I was envisaging that you read out from the compound list what the case is and just do what you can (smiles = normal workflow, no smiles but formula = tentative, no smiles+no formula= unknown. No smiles, no formula, no mz = error). Because I have these all mixed together in my one compound list and each one has a unique ID, so there are no accession clashes. Whether they are actually processed one after the other (normal, then tentative, then unknown) or not is actually irrelevant (to the end user at least - Michele agreed that it makes sense to process separately). The recalibration function can be calculated on the "knowns". This can then be used to adjust the masses in the subsequent tentatives and/or unknowns - this must be possible because even the masses of the fail peaks can be adjusted with the calibration. We don't want to recalculate the recalibration on tentatives (and we can't recalculate it on unknowns, like you said). But, I still like the idea that one has to define explicitly whether to include tentatives or unknowns. The default workflow should certainly stay with the knowns (so the normal user gets an error if smiles are missing, for example). From the user point of view, it's much easier if it's all bunched together in one compound list, one file list (the files are organised by machine runs and e.g. for us are mixed knowns, tentatives and unknowns) and I get one infolist and failpeak list in the end to check, not 3. Of course for the unknown we'd have to have no fail peaks... right? Because they'd all fail. Does this make sense?

ermueller commented 8 years ago

I still like the idea that one has to define explicitly whether to include tentatives or unknowns.

If we're defining explicitely:

You could do the level numbers thing or you could write "tentative" or "unknown" in the last column of a compound list. This has the advantage that in a level 3 run where SMILES is present BUT tentative a COMMENT could be auto-added that says "Tentative spectrum".

Minor detail: If it's not the level numbers but "known", "tentative" or "unknown" (easier to understand for the user), what should the column name be?

Writing an auto-detect would also be no big deal, really, but then we couldn't do the level 3 autocomment since we wouldn't know whether the SMILES is correct or just tentative.

Whether they are actually processed one after the other (normal, then tentative, then unknown) or not is actually irrelevant (to the end user at least - Michele agreed that it makes sense to process separately).

It definitely makes sense, but RMassBank is not structured to do several runs. Making the 3 runs separate is a solution that would work, but it'd probably take a bit longer and in the end would be just as complicated to program. I'd rather process on a compound by compound basis, since it does that already - I already changed quite a few internal functions to be able to discern between known, tentative and unknown, so I might as well walk the whole nine yards and do that for the rest as well.

I can undo the added "retrieval" parameter to msmsWorkflow, mbWorkflow and msmsRead and everything works internally. Then the only change for the user would be in the compound list.

The recalibration function can be calculated on the "knowns".

Ahhhh, that makes sense. But, tentative with SMILES has to be explicitely stated then. Can't get around that?

tsufz commented 8 years ago

I also like the idea to tag the entries with the known, tentative and unknown tags. Might is possible to use both? Levels and tags? The levels are more advanced, but the tags are a 'quick and dirty' thing. Don't overload the users with requirements.

I agree with Erik to process in parallel if possible. I dunno what are the performance differences of separate and parallel processing. But from users view it is more reasonable to follow only one loop in the frontend. In the backend a stacked processing might be reasonable.

schymane commented 8 years ago

I already have the levels in my compound list and also already that level 3 case you mention :). I see two possibilities: you leave the user free to type what they want, then the records are a mess because it's just text. Or we specify exact words and levels and build the real comment behind the scenes, so the user could enter either 3 or tentative, for example. We'd have confirmed or reference standard (level 1), probable library (2a), probable diagnostic (2b), tentative (3), formula (4) or unknown (5). What do you think? The wording for the exact comment is a bit longer...and my problem is that I sub-categorised level 3 sometimes too. 3a structure, 3b isomer, 3c substance class, 3d best match? But this is not written anywhere. 3a only if smiles are given.


From: emueller [notifications@github.com] Sent: Thursday, 7 January 2016 8:44 PM To: MassBank/RMassBank Cc: Schymanski, Emma Subject: Re: [RMassBank] Processing of tentatives/unknowns (#42)

I still like the idea that one has to define explicitly whether to include tentatives or unknowns.

If we're defining explicitely:

You could do the level numbers thing or you could write "tentative" or "unknown" in the last column of a compound list. This has the advantage that in a level 3 run where SMILES is present BUT tentative a COMMENT could be auto-added that says "Tentative spectrum".

Minor detail: If it's not the level numbers but "known", "tentative" or "unknown" (easier to understand for the user), what should the column name be?

Writing an auto-detect would also be no big deal, really, but then we couldn't do the level 3 autocomment since we wouldn't know whether the SMILES is correct or just tentative.

Whether they are actually processed one after the other (normal, then tentative, then unknown) or not is actually irrelevant (to the end user at least - Michele agreed that it makes sense to process separately).

It definitely makes sense, but RMassBank is not structured to do several runs. Making the 3 runs separate is a solution that would work, but it'd probably take a bit longer and in the end would be just as complicated to program. I'd rather process on a compound by compound basis, since it does that already - I already changed quite a few internal functions to be able to discern between known, tentative and unknown, so I might as well walk the whole nine yards and do that for the rest as well.

I can undo the added "retrieval" parameter to msmsWorkflow, mbWorkflow and msmsRead and everything works internally. Then the only change for the user would be in the compound list.

The recalibration function can be calculated on the "knowns".

Ahhhh, that makes sense. But, tentative with SMILES has to be explicitely stated then. Can't get around that?

� Reply to this email directly or view it on GitHubhttps://github.com/MassBank/RMassBank/issues/42#issuecomment-169786024.

uchem-massbank commented 8 years ago

Column name 'Confidence'?


From: schymane [notifications@github.com] Sent: Thursday, 7 January 2016 10:12 PM To: MassBank/RMassBank Cc: massbank Subject: Re: [RMassBank] Processing of tentatives/unknowns (#42)

I already have the levels in my compound list and also already that level 3 case you mention :). I see two possibilities: you leave the user free to type what they want, then the records are a mess because it's just text. Or we specify exact words and levels and build the real comment behind the scenes, so the user could enter either 3 or tentative, for example. We'd have confirmed or reference standard (level 1), probable library (2a), probable diagnostic (2b), tentative (3), formula (4) or unknown (5). What do you think? The wording for the exact comment is a bit longer...and my problem is that I sub-categorised level 3 sometimes too. 3a structure, 3b isomer, 3c substance class, 3d best match? But this is not written anywhere. 3a only if smiles are given.


From: emueller [notifications@github.com] Sent: Thursday, 7 January 2016 8:44 PM To: MassBank/RMassBank Cc: Schymanski, Emma Subject: Re: [RMassBank] Processing of tentatives/unknowns (#42)

I still like the idea that one has to define explicitly whether to include tentatives or unknowns.

If we're defining explicitely:

You could do the level numbers thing or you could write "tentative" or "unknown" in the last column of a compound list. This has the advantage that in a level 3 run where SMILES is present BUT tentative a COMMENT could be auto-added that says "Tentative spectrum".

Minor detail: If it's not the level numbers but "known", "tentative" or "unknown" (easier to understand for the user), what should the column name be?

Writing an auto-detect would also be no big deal, really, but then we couldn't do the level 3 autocomment since we wouldn't know whether the SMILES is correct or just tentative.

Whether they are actually processed one after the other (normal, then tentative, then unknown) or not is actually irrelevant (to the end user at least - Michele agreed that it makes sense to process separately).

It definitely makes sense, but RMassBank is not structured to do several runs. Making the 3 runs separate is a solution that would work, but it'd probably take a bit longer and in the end would be just as complicated to program. I'd rather process on a compound by compound basis, since it does that already - I already changed quite a few internal functions to be able to discern between known, tentative and unknown, so I might as well walk the whole nine yards and do that for the rest as well.

I can undo the added "retrieval" parameter to msmsWorkflow, mbWorkflow and msmsRead and everything works internally. Then the only change for the user would be in the compound list.

The recalibration function can be calculated on the "knowns".

Ahhhh, that makes sense. But, tentative with SMILES has to be explicitely stated then. Can't get around that?

� Reply to this email directly or view it on GitHubhttps://github.com/MassBank/RMassBank/issues/42#issuecomment-169786024.

— Reply to this email directly or view it on GitHubhttps://github.com/MassBank/RMassBank/issues/42#issuecomment-169806424.

meowcat commented 8 years ago

Do you think I can pull this branch to the main repo already?

schymane commented 8 years ago

@meowcat I suspect it will change but if it makes your life easier for now, I vote yes. @ermueller I just emailed you a "translation table" csv for level, keyword and resulting comment, hope you get it but let me know if your email is still playing up...

schymane commented 8 years ago

w <- msmsRead(w, filetable="E:/UchemSpectra/Processed/AndreaTPs_ErikRMB/AndreaTPs_FileList_2014_10_tentatives.csv", mode=RMB_mode, readMethod = "mzR",retrieval="tentative")

lapply(w@spectra,function(s) s@found) $CYP1.mzML [1] TRUE $CYP1.mzML [1] TRUE ....

Yay :)

ermueller commented 8 years ago

Thanks for the csv! I can definitely work with that.

@meowcat @uchem-massbank Please don't pull yet, I'll rework some stuff. Don't want "nonfinal" things in the main repo. It'll probably be finished next week if I don't hit any major roadblocks. :)

uchem-massbank commented 8 years ago

No probs! I’ve got my first set of tentative records using your workflow (much better calibration than for the knowns, because I had too few “knowns” => this really adds to the argument to do all together) and I’m about to try the first unknown. Pretty happy so far ;)

ermueller commented 8 years ago

Wait, but @schymane said:

We don't want to recalculate the recalibration on tentatives (and we can't recalculate it on unknowns, like you said).

So, now I'm kinda confused at what exactly to do with tentatives in recalibration. You mean level 4 (formula only), correct? I mean, there definitely is the possibility that someone knows the formula for sure, but just not the structure, so why not use it in recal? On the other hand, if the formula turns out to be wrong that could be mighty bad for recal.

Introduce 4a and 4b? 4a "Definitive molecular formula", 4b "Likely molecular formula"?

uchem-massbank commented 8 years ago

My email account is playing tricks on my identity, sorry – half the time it answers as uchem-mb ;) So – the rule with the recalibration is as long as we have a formula (or better still smiles) we can recalibrate. Irrespective of whether the structure is tentative or not. So, if it’s not Level 5, let’s use it in the recalibration. We really need this. No to the multiple level 4s – level 4 is defined as an unequivocal molecular formula, if you are not sure what the formula is, it’s Level 5.

schymane commented 8 years ago

@ermueller RE: your email:

w <- msmsRead(w,mode="pH",files=w@files,readMethod="mzR",cpdids=c(1,2,3),retrieval="tentative") w <- msmsWorkflow(w, mode="pH", steps=2:8,archivename="TEST", retrieval="tentative") mb <- newMbWorkspace(w) mb <- mbWorkflow(mb, 1:2, retrieval="tentative") mb <- loadInfolist(mb,"./infolist.csv") mb <- mbWorkflow(mb, 3:8, retrieval="tentative")

This seems to actually work fine without retrieval=”tentative” after the msmsRead stage. Which was nice ;)

schymane commented 8 years ago

@ermueller @meowcat I solved the error in the unknown file so you can ignore the email from yesterday - in this case retrieval="unknown" is essential at all steps, unlike retrieval="tentative" which I only needed for msmsRead. So now I have the records of the unknown too ;) and the recalibration code in the vignette works. Look forward to the next version!

meowcat commented 8 years ago

From looking at the code today when merging, it seemed to me that actually, the way the "standard" / "tentative" / "unknown" arguments are used is a bit redundant with the actual specifications in the compoundlist - the argument would not actually be needed when processing, since the entry in the compoundlist already gives it away whether a compound is a known, tentative or unknown identification... Just a thought...

uchem-massbank commented 8 years ago

I had assumed that this would be read from the compound list in the future version rather than defined explicitly, this would also make much more sense to me. But I guess Erik knows best what he is doing :-)

ermueller commented 8 years ago

Yes, I would remove these arguments from the functions. :) It wouldn't make any sense to doubly specify that stuff.

ermueller commented 8 years ago

You all got a mail. It's finished! - I hope.

I'm sorry, @meowcat, I just realized you weren't a recipient, because I forgot to add you. I just hit "reply all" and wrote in there on the big mail back-and-forth from yesterday. The new stuff is in my repo as well.

schymane commented 8 years ago

Great :-) Can’t wait to try it out. Just one question:

ermueller commented 8 years ago

Empty files. I can change that, too?

schymane commented 8 years ago

Yes, would be much better to have no files at all and no entry in list.tsv…

ermueller commented 8 years ago

Did that, it's in my repo now with no molfiles at all and no entry in list.tsv

meowcat commented 8 years ago

Hi Erik,

really cool. I am just running a few tests (not everything perfect yet but I want to examine the user error component first :) )

One suggestion, what about changing findMz:

findMz <- function(cpdID, mode="pH", ppm=10, deltaMz=0)
{
    retrieval <- findLevel(cpdID, TRUE)
    # In case of unknown: m/z is in table

etc? so you could take out the retrieval argument from many more places.

I merged everything into master here already.

meowcat commented 8 years ago

wrt using neutral or charged masses in the compoundlist mz column, this is a bit of a question of definition. Technically, findMz should return the mass according to the "mode" set ("pH", "mH" etc) so using the neutral mass in the compoundlist would be more consistent. However, I don't know what @schymane thinks about this, since "in our heads" we typically think of the m/z and not of the uncharged mass; also there is no guarantee for level 5 (which means no certain formula) that m/z - massH + electron is actually the neutral M. Opinions?

meowcat commented 8 years ago

Also, findFormula() could do findLevel by itself without needing the argument.

schymane commented 8 years ago

Yes, good point. The Level 5 mass is indeed the charged one (“the exact mass of interest”) and therefore would not need the mode correction – which is counter-intuitive to the mode setting I agree. But if we change to the neutral mass, users will have to “guess” the adduct state to input a neutral mass that will mean that RMassBank can calculate back to the exact mass they were interested in in the first place and this could get messy… I’d see this as more of a risk. I’d vote to stick to the level 5 being the charged mass (“exact mass of interest”) and it should be clearly stated somewhere that level 5 masses are NOT adjusted according to the mode settings?

From: meowcat [mailto:notifications@github.com] Sent: Thursday, 21 January 2016 1:46 PM To: MassBank/RMassBank RMassBank@noreply.github.com Cc: Schymanski, Emma Emma.Schymanski@eawag.ch Subject: Re: [RMassBank] Processing of tentatives/unknowns (#42)

wrt using neutral or charged masses in the compoundlist mz column, this is a bit of a question of definition. Technically, findMz should return the mass according to the "mode" set ("pH", "mH" etc) so using the neutral mass in the compoundlist would be more consistent. However, I don't know what @schymanehttps://github.com/schymane thinks about this, since "in our heads" we typically think of the m/z and not of the uncharged mass; also there is no guarantee for level 5 (which means no certain formula) that m/z - massH + electron is actually the neutral M. Opinions?

— Reply to this email directly or view it on GitHubhttps://github.com/MassBank/RMassBank/issues/42#issuecomment-173559701.

meowcat commented 8 years ago

@everyone: I branched off all the issues with installation that are unrelated to this question so this stays a bit more focused... #149

meowcat commented 8 years ago

after a bit of discussion with @schymane in the office:

on one point you are right about the mass of interest often being the m/z, on the other hand this depends on the workflow (are you using a nontarget list, are you recording spectra of tentatively identified compounds etc.) For example, I have an MS/MS method that records 9 positive and 9 negative MS2 for every compound in the file. Here it would be much easier to process both with mode "pH" then "mH" starting from the neutral mass.

The first solution we were talking about would be to add one more column to the compound list that specifies what is in the mz column, or a parameter to loadList which specifies how to interpret the mz column. However I have another idea now - can we just check what the column is called? so if the column is called m/z or mz it is a charged mass, and if the column is called "mass" it is the neutral mass. Would this be confusing?

schymane commented 8 years ago

What about mz and neutral_mass – then there is no room for doubt?

meowcat commented 8 years ago

fixed a bug in mbworkflow step 6 (in this repo)

ermueller commented 8 years ago

Yeah, mz or "m/z" and neutral_mass sound great.

Currently I'm calculating the mass with findMass by doing this:

if(mode == "pH") {
            mass <- 1.00784
            mode.charge <- 1
        } else if(mode == "pNa") {
            mass <- 22.989769
            mode.charge <- 1
        } else if(mode == "pM") {
            mass <- 0
            mode.charge <- 1
        } else if(mode == "mM") {
            mass <- 0
            mode.charge <- -1
        } else if(mode == "mH") {
            mass <- -1.00784
            mode.charge <- -1
        } else if(mode == "mFA") {
            mass <- 59.0440
            mode.charge <- -1
        } else if(mode == "pNH4") {
            mass <- 18.03846
            mode.charge <- 1
        }

and then calling

findMz(cpdID_or_smiles, mode=mode, retrieval=retrieval)$mzCenter - mass + mode.charge * .emass

or is that bad?

Also @meowcat yes, I will remove the retrieval parameter for these functions, I dunno what I was thinking. Didn't see the forest for the trees there.

meowcat commented 8 years ago

Currently I'm calculating the mass with findMass by doing this:

One could just try inside findMz:

    if(retrieval == "unknown"){
        mz <- .listEnvEnv$listEnv$compoundList[which(.listEnvEnv$listEnv$compoundList$ID == cpdID),.listEnvEnv$listEnv$mzCol]
        if(neutralblabla)
        {
                shift <- findMz.formula("", mode)
                mz <- mz + shift$mzCenter
        }
        # etc
        }

and add a check for empty string in findMz.formula, hope I'm not forgetting anything here...

ermueller commented 8 years ago

That doesn't work for "mM"(Error), or "mH"(Just returns 1.0000X when it should return -1.0000X)

ermueller commented 8 years ago

Oh and pM either. But I could just handle these cases in an "if".

meowcat commented 8 years ago

Hm but this means my function is working incorrectly. I will fix this :)

meowcat commented 8 years ago

Try the fix in master branch d05725f

meowcat commented 8 years ago

Digging up this old one. I am using tentative processing routinely. But we still don't have the "neutral_mass" option, right? Or is it in a branch I don't have? I would need it urgently...

schymane commented 8 years ago

@meowcat is now building in the option to treat the mass as either neutral or charged … we just talked about it

meowcat commented 8 years ago

As @schymane said, I am fixing this (currently in the S4power branch) with an option in the settings file, which is also a parameter to the findMz function:

unknownMass: charged # (or absence of the option): behaviour as up to now (the mass is m/z)
unknownMass: neutral # the mass is treated as neutral mass and +H-e or -H+e are applied accordingly.
meowcat commented 4 years ago

@schymane @tsufz Vignette for processing of unknowns?

schymane commented 4 years ago

Sounds like a good idea ... do we not have it in the nonstandard vignette?