cdk / cdk

The Chemistry Development Kit
https://cdk.github.io/
GNU Lesser General Public License v2.1
477 stars 155 forks source link

Isotope pattern with composition #493

Closed SteffenHeu closed 5 years ago

SteffenHeu commented 5 years ago

Hi, I implemented an isotope pattern calculator into MZmine 2 that also contains an isotope composition, which was used to assign the isotope composition to peaks in mass spectrometry data.

[Edit: For example, if an isotope pattern was calculated, the IsotopePattern object only contains mass and intensity values. The extended functionality provided an additional String that contained "35Cl/37Cl/" which lets you access the elements and their mass numbers in a specific isotopic peak to assign them to acquired ms-spectra which was useful for more complex molecules or tandem ms data. This String was processed to yield something like "^12C6 ^35Cl ^37Cl" (See PR1 for more)]

@tomas-pluskal recommended to port this to the CDK to remove duplicate code. Would you want this functionality in the CDK? I guess, I would need to do some changes to the IsotopePattern and IsotopePatternGenerator classes. Since I'm just a master student I don't have that much time during the semester, so I would only do this, if you think this would be beneficial.

This is what I'm referring to. These are the original PRs in case you want to have a look first. (PR1, PR2)

Best regards, Steffen

tomas-pluskal commented 5 years ago

@SteffenHeu Perhaps you should explain more clearly what you mean by "isotope composition".

johnmay commented 5 years ago

Do you have an example usage snippet - couldn't find the unit test with a quick look. I can see this would be useful but the CDK is already quite bloated and so I'd like to keep public APIs as tight as possible.

SteffenHeu commented 5 years ago

I created a small snippet here, which will give you this output:

Okt 20, 2018 2:17:25 PM net.sf.mzmine.modules.tools.isotopepatternpreview.Test main INFORMATION: ^35Cl1 ^37Cl1 + ^35Cl2 ^13C2 Okt 20, 2018 2:17:25 PM net.sf.mzmine.modules.tools.isotopepatternpreview.Test main INFORMATION: ---------------- mass: 141.93715678009056 intensity: 1.0 ^35Cl2 mass: 142.94051162009055 intensity: 0.06489436975639341 ^35Cl2 ^13C1 mass: 143.93423310552126 intensity: 0.6416702223808065 ^35Cl1 ^37Cl1 + ^35Cl2 ^13C2 mass: 144.93756741269797 intensity: 0.041552219016425376 ^35Cl1 ^37Cl1 ^13C1 + ^35Cl2 ^13C3 mass: 145.93136144015367 intensity: 0.10349603387521733 ^37Cl2 + ^35Cl1 ^37Cl1 ^13C2 + ^35Cl2 ^13C4

The 12C isotope will not be displayed by default to make the description more compact.

This picture shows a good example of how the code could be used: preview It's a simple isotope pattern calculation that will also display the composition in the tool tips since they are also stored in the isotope pattern class.

Edit: This is where the isotope pattern is calculated and plotted for the preview window.

tomas-pluskal commented 5 years ago

I think it makes sense for the IsotopePattern class to store the information about each isotope in the pattern. For example, each isotope could be represented as an IAtomContainer that contains the corresponding IIsotope instances.

Converting this composition to a String can be done by a separate utility class.

johnmay commented 5 years ago

Looks fine, as far as I can tell it's a single class right?

Converting this composition to a String can be done by a separate utility class.

Or just use the MF? Looks better as HTML but here's an example

IChemObjectBuilder bldr = SilentChemObjectBuilder.getInstance();
IMolecularFormula mf = bldr.newInstance(IMolecularFormula.class);
mf.addIsotope(new Atom("C"), 6);
mf.addIsotope(new Atom("35Cl"), 2);
assertThat(MolecularFormulaManipulator.getString(mf, false, true), is("C6[35]Cl2"));
IMolecularFormula mf2 = bldr.newInstance(IMolecularFormula.class);
mf2.addIsotope(new Atom("C"), 5);
mf2.addIsotope(new Atom("13C"), 1);
mf2.addIsotope(new Atom("35Cl"), 2);
assertThat(MolecularFormulaManipulator.getString(mf2, false, true), is("C5[13]C[35]Cl2"));
IMolecularFormula mf3 = bldr.newInstance(IMolecularFormula.class);
mf3.addIsotope(new Atom("C"), 5);
mf3.addIsotope(new Atom("13C"), 1);
mf3.addIsotope(new Atom("35Cl"), 1);
mf3.addIsotope(new Atom("37Cl"), 1);
assertThat(MolecularFormulaManipulator.getString(mf3, false, true), is("C5[13]C[35]Cl[37]Cl"));
SteffenHeu commented 5 years ago

Looks fine, as far as I can tell it's a single class right?

If you are referring to the ExtendedIsotopePattern yes. It does all the calculations on it's own because the IsotopePatternGenerator did not yield the information I needed, but from what I can tell the way of calculation is pretty similiar.

Or just use the MF? Looks better as HTML but here's an example

This looks like a good idea. But you would have to create one MF for every isotopic peak, so i think the IsotopePattern class would use a lot more memory than it does at the moment. Does it maybe make sense to have a additional class that extends the current IsotopePattern class, so programms that use it at the moment are not affected by the change?

johnmay commented 5 years ago

would use a lot more memory than it does at the moment

You could reuse the objects to avoid the mallocs, but sure

Does it maybe make sense to have a additional class that extends the current IsotopePattern class, so programms that use it at the moment are not affected by the change?

Well any patches should not break existing downstream code (except for major version increments).

You'll be much more familiar with these parts of the code than me but you could look at your class as simply a different isotope generator algorithm right? For you example:

mass: 141.93715678009056    intensity: 1.0  ^35Cl2
mass: 142.94051162009055    intensity: 0.06489436975639341  ^35Cl2 ^13C1
mass: 143.93423310552126    intensity: 0.6416702223808065   ^35Cl1 ^37Cl1 + ^35Cl2 ^13C2
mass: 144.93756741269797    intensity: 0.041552219016425376 ^35Cl1 ^37Cl1 ^13C1 + ^35Cl2 ^13C3
mass: 145.93136144015367    intensity: 0.10349603387521733  ^37Cl2 + ^35Cl1 ^37Cl1 ^13C2 + ^35Cl2 

Could be represented as isotope containers - in the existing IsotopePattern.

johnmay commented 5 years ago

Okay rereading original comment now:

The extended functionality provided an additional String that contained "35Cl/37Cl/" which lets you access the elements and their mass numbers in a specific isotopic peak to assign them to acquired ms-spectra which was useful for more complex molecules or tandem ms data.

Okay so I think this is just a bug as the MF should be provided by the isotope container but it's not:

IMolecularFormula molFor = MolecularFormulaManipulator.getMajorIsotopeMolecularFormula("C6Cl2", builder);
IsotopePatternGenerator isotopeGe = new IsotopePatternGenerator(0.01);
IsotopePattern isos = isotopeGe.getIsotopes(molFor);
for (IsotopeContainer container : isos.getIsotopes()) {
    System.out.printf("mass=%.4f, intensity=%.2f, %s\n", container.getMass(), container.getIntensity(), container.getFormula());
}

gets:

mass=141.9377, intensity=1.00, null
mass=142.9411, intensity=0.06, null
mass=143.9348, intensity=0.64, null
mass=144.9381, intensity=0.04, null
mass=145.9318, intensity=0.10, null

So as far as I can tell, essentially what is needed is some extra options for the generator, and the MF to not be null:

IMolecularFormula molFor = MolecularFormulaManipulator.getMajorIsotopeMolecularFormula("C6Cl2", builder);
IsotopePatternGenerator isotopeGe = new IsotopePatternGenerator(0.01);
//        isotopeGe.setMinIntensity(0.03); // new constraint?
//        isotopeGe.setMzMerge(0.02); // new  constraint?
IsotopePattern isos = isotopeGe.getIsotopes(molFor);
for (IsotopeContainer container : isos.getIsotopes()) {
    // container.getFormula() new: should not be null
    System.out.printf("mass=%.4f, intensity=%.2f, %s\n", container.getMass(), container.getIntensity(), container.getFormula());
}
tomas-pluskal commented 5 years ago

I agree there is no need to create a new class.

I think it would be useful for each IsotopeContainer to also have an information about its composition in terms of IIsotope instances and their counts (besides the molecular formula). Like @johnmay said, the IIsotope instances can be reused, so it won't take much extra memory.

johnmay commented 5 years ago

So I've had a play around with the existing generator and made some changes, and optimisations (#494). A lot of the cdk-formula code needs rewriting/optimising but when I see obvious things it's easy to fix.

I did make the ability to store formulas optional, as it made one test (C10000) take a second or so (optimising the MF data structure would fix that but don't fell like doing that now). One thing that wasn't clear was this:

mass: 143.93423310552126    intensity: 0.6416702223808065   ^35Cl1 ^37Cl1 + ^35Cl2 ^13C2

Do I read that has two different compositions for that peak?

However I get the following different masses for those compositions:

143.93475526999998 = 12C635Cl37Cl 142.94106019999998 = 12C513C35Cl2

tomas-pluskal commented 5 years ago

I think the second composition should have two 13C atoms:

12C413C235Cl2 = 143.9438666

But that is still almost 10 mDa off.

johnmay commented 5 years ago

But that is still almost 10 mDa off.

Ah I think the merge parameter was set 0.02, I used 0.00005 which was the current default.

tomas-pluskal commented 5 years ago

I think the TOLERANCE for merging should definitely be a user-specified parameter, same as the minimal abundance.

johnmay commented 5 years ago

Yep I think that makes senses as depends on what resolution you have.

tomas-pluskal commented 5 years ago

The code for merging should also update the mass of each container, e.g. as a weighted average of the masses of all isotopes in the container?

johnmay commented 5 years ago

Okay I updated the patch as suggested - also again made further speed improvements. There is actually a much faster algorithm when I started thinking about it but is OK for now.

tomas-pluskal commented 5 years ago

Thanks, John! Do I understand correctly that setting the merge treshhold to 0 will make it generate all isotopes without merging? (I think it is important to have the merging as an optional operation)

tomas-pluskal commented 5 years ago

@SteffenHeu can you check if the updated isotope pattern generator covers all your needs?

SteffenHeu commented 5 years ago

I've had a quick look which seemed good, but i'll have to check that with some more time, should be able to do that during the weekend. Did you not have to make any changes to the IsotopePattern class? I just see changes to the IsotopeContainer, MFManipulator, IPGenerator and the test class. Can i access the molecular formula of the pattern via a method in IsotopePattern?

johnmay commented 5 years ago

Yes no changes needed, only API additions was IsotopeContainer needed to hold multiple formulas.

johnmay commented 5 years ago

Can i access the molecular formula of the pattern via a method in IsotopePattern?

Yes, but you always could via the IsotopeContainers the issue is in the generator it wasn't sent.

tomas-pluskal commented 5 years ago

@SteffenHeu John is planning to make a new CDK release this weekend, so it would be great if could check if the current code fulfills all your needs asap. Otherwise we'll have to wait for the next CDK release, which might be a few months down the road.

SteffenHeu commented 5 years ago

I just tested some stuff and it looks good to me. Thanks John!