Processing of tentatives/unknowns

sneumann commented 10 years ago

Issue by schymane from Monday Oct 28, 2013 at 13:42 GMT Originally opened as https://github.com/sneumann/RMassBank/issues/34

This relies on other discussions to set the tagging. TBC.

schymane commented 9 years ago

Talk about this after BioC release

schymane commented 9 years ago

Erik and I also discussed this today, a summary: Michele and I both feel that MetFrag in RMassBank is out of the scope, certainly for the moment. More fundamental things that need to be addressed first to enable RMassBank to even create tentative/unknown spectra suitable for upload to MassBank. Assuming the user knows why they want the spectra and how they identified it we need the right comment fields, the correct way to represent the results, what do we do without structures, how to skip the recalibration and formula annotation without a formula available ... (but still do this if we do have a formula but not structure). Case 1: we have a structure but it's not certain => this is only a comment in the MassBank record Case 2: we aren't exactly sure about the structure but we have a formula => can still perform most aspects of the workflow Case 3: we have an exact mass with a cool spectrum that we want to share with the world, but we can't identify it Strategy: RMassBank works off a compound list where the user gives SMILES (best case) or for tentatives/unknowns either a formula or mass along with a comment field and RMassBank takes this information and prepares the records as best it can (note, we probably need to deal with mixed lists...) Once we have the records there is absolutely nothing stopping anyone doing MetFrag on them ... we actually almost have all the bits and pieces ready but this is a different connection/scope and in my opinion the job of another package and not RMassBank at the moment.

ermueller commented 8 years ago

So, this is theoretically finished, but it needs cleanup and stuff like that. Docs are already in there, but this is all subject to change, since I already typed up a list for myself on where to clean up some code.

You'll all get a mail today with everything, I'll just test this a bit more and then type up some instructions.

Also, add a vignette for tentatives/unknowns? I feel that we have to. But I'd make that a separate issue.

EDIT: Just to clarify, I'll push the "unclean" version for now - the cleanup stuff would take quite some time.

schymane commented 8 years ago

Great! I agree that we need a vignette and that it should be a separate issue. Look forward to testing it ;)

ermueller commented 8 years ago

I hope you all read this, since my uni mailserver half-crashed 5 minutes ago:

I forgot the attachment in my mail, but you can install the package from git from my master branch. https://github.com/ermueller/RMassBank

schymane commented 8 years ago

Minor feedback to your email point 3: @ermueller a) If you want to process unknowns you need an additional column "m/z" with your target m/z. => Michele and I would be very happy if you take "mz" instead of "m/z" to avoid ugly conversion issues downstream.

ermueller commented 8 years ago

No problem, I'll rework that. Can both be valid? :)

Also, I'll probably autodetect what you can do with a certain compound list and give the user an output when the list is loaded. Something like "This compoundlist can be used for the following retrieval methods: standard, tentative"

Also, I'll save that somewhere internally and make workflow retrieval methods unable if the compoundlist loaded is not appropriate. Sounds good?

schymane commented 8 years ago

I’m just trying to wrap my head around your way of building it, because I envisaged doing it all together. Two conceptually challenging bits: We’ll have to transfer the recalibration to the tentatives and unknowns from the standard workflow We have the problem that we now have several runs for one dataset and this is a teeny tiny problem at the record generation stage because the list file is overwritten, see #145. I’m out of time for now but can get you more feedback tonight or tomorrow.

ermueller commented 8 years ago

1) Transfer recalibration for unknowns (without annotation)? How would you calculate (and adjust) dppm for peaks if there are no formulas? Recalibration should work for tentatives.

2) Several runs for one dataset? I'm not sure I understand. Different in what respect? I can def program a multiprocessing in for pretty much every option, as long as the resulting records would have a different accession.

EDIT: Oh yeah, now I get it, you want the "level" column added and used for every compound differently, right? I wanted to do that, too - just wanted a more general solution first. I'm incredibly sorry that it turned out to be a hassle for you. :( I'll try my best to implement that asap.

schymane commented 8 years ago

Yeah so I was envisaging that you read out from the compound list what the case is and just do what you can (smiles = normal workflow, no smiles but formula = tentative, no smiles+no formula= unknown. No smiles, no formula, no mz = error). Because I have these all mixed together in my one compound list and each one has a unique ID, so there are no accession clashes. Whether they are actually processed one after the other (normal, then tentative, then unknown) or not is actually irrelevant (to the end user at least - Michele agreed that it makes sense to process separately). The recalibration function can be calculated on the "knowns". This can then be used to adjust the masses in the subsequent tentatives and/or unknowns - this must be possible because even the masses of the fail peaks can be adjusted with the calibration. We don't want to recalculate the recalibration on tentatives (and we can't recalculate it on unknowns, like you said). But, I still like the idea that one has to define explicitly whether to include tentatives or unknowns. The default workflow should certainly stay with the knowns (so the normal user gets an error if smiles are missing, for example). From the user point of view, it's much easier if it's all bunched together in one compound list, one file list (the files are organised by machine runs and e.g. for us are mixed knowns, tentatives and unknowns) and I get one infolist and failpeak list in the end to check, not 3. Of course for the unknown we'd have to have no fail peaks... right? Because they'd all fail. Does this make sense?

ermueller commented 8 years ago

I still like the idea that one has to define explicitly whether to include tentatives or unknowns.

If we're defining explicitely:

You could do the level numbers thing or you could write "tentative" or "unknown" in the last column of a compound list. This has the advantage that in a level 3 run where SMILES is present BUT tentative a COMMENT could be auto-added that says "Tentative spectrum".

Minor detail: If it's not the level numbers but "known", "tentative" or "unknown" (easier to understand for the user), what should the column name be?

Writing an auto-detect would also be no big deal, really, but then we couldn't do the level 3 autocomment since we wouldn't know whether the SMILES is correct or just tentative.

Whether they are actually processed one after the other (normal, then tentative, then unknown) or not is actually irrelevant (to the end user at least - Michele agreed that it makes sense to process separately).

It definitely makes sense, but RMassBank is not structured to do several runs. Making the 3 runs separate is a solution that would work, but it'd probably take a bit longer and in the end would be just as complicated to program. I'd rather process on a compound by compound basis, since it does that already - I already changed quite a few internal functions to be able to discern between known, tentative and unknown, so I might as well walk the whole nine yards and do that for the rest as well.

I can undo the added "retrieval" parameter to msmsWorkflow, mbWorkflow and msmsRead and everything works internally. Then the only change for the user would be in the compound list.

The recalibration function can be calculated on the "knowns".

Ahhhh, that makes sense. But, tentative with SMILES has to be explicitely stated then. Can't get around that?

tsufz commented 8 years ago

I also like the idea to tag the entries with the known, tentative and unknown tags. Might is possible to use both? Levels and tags? The levels are more advanced, but the tags are a 'quick and dirty' thing. Don't overload the users with requirements.

I agree with Erik to process in parallel if possible. I dunno what are the performance differences of separate and parallel processing. But from users view it is more reasonable to follow only one loop in the frontend. In the backend a stacked processing might be reasonable.

schymane commented 8 years ago

I already have the levels in my compound list and also already that level 3 case you mention :). I see two possibilities: you leave the user free to type what they want, then the records are a mess because it's just text. Or we specify exact words and levels and build the real comment behind the scenes, so the user could enter either 3 or tentative, for example. We'd have confirmed or reference standard (level 1), probable library (2a), probable diagnostic (2b), tentative (3), formula (4) or unknown (5). What do you think? The wording for the exact comment is a bit longer...and my problem is that I sub-categorised level 3 sometimes too. 3a structure, 3b isomer, 3c substance class, 3d best match? But this is not written anywhere. 3a only if smiles are given.

From: emueller [notifications@github.com] Sent: Thursday, 7 January 2016 8:44 PM To: MassBank/RMassBank Cc: Schymanski, Emma Subject: Re: [RMassBank] Processing of tentatives/unknowns (#42)

I still like the idea that one has to define explicitly whether to include tentatives or unknowns.

If we're defining explicitely:

You could do the level numbers thing or you could write "tentative" or "unknown" in the last column of a compound list. This has the advantage that in a level 3 run where SMILES is present BUT tentative a COMMENT could be auto-added that says "Tentative spectrum".

Minor detail: If it's not the level numbers but "known", "tentative" or "unknown" (easier to understand for the user), what should the column name be?

Writing an auto-detect would also be no big deal, really, but then we couldn't do the level 3 autocomment since we wouldn't know whether the SMILES is correct or just tentative.

Whether they are actually processed one after the other (normal, then tentative, then unknown) or not is actually irrelevant (to the end user at least - Michele agreed that it makes sense to process separately).

It definitely makes sense, but RMassBank is not structured to do several runs. Making the 3 runs separate is a solution that would work, but it'd probably take a bit longer and in the end would be just as complicated to program. I'd rather process on a compound by compound basis, since it does that already - I already changed quite a few internal functions to be able to discern between known, tentative and unknown, so I might as well walk the whole nine yards and do that for the rest as well.

I can undo the added "retrieval" parameter to msmsWorkflow, mbWorkflow and msmsRead and everything works internally. Then the only change for the user would be in the compound list.

The recalibration function can be calculated on the "knowns".

Ahhhh, that makes sense. But, tentative with SMILES has to be explicitely stated then. Can't get around that?

� Reply to this email directly or view it on GitHubhttps://github.com/MassBank/RMassBank/issues/42#issuecomment-169786024.

uchem-massbank commented 8 years ago

Column name 'Confidence'?

From: schymane [notifications@github.com] Sent: Thursday, 7 January 2016 10:12 PM To: MassBank/RMassBank Cc: massbank Subject: Re: [RMassBank] Processing of tentatives/unknowns (#42)