MassBank / MassBank-data

Official repository of open data MassBank records
68 stars 55 forks source link

Some records have duplicated fragmentation mode information in 2022.12 release #207

Closed jorainer closed 1 year ago

jorainer commented 1 year ago

I stumbled across some inconsistencies in the MassBank AC_MASS_SPECTROMETRY table: there are 21 spectra (records) that have a duplicated FRAGMENTATION_MODE SUBTYPE:

Example:

MariaDB [MassBank]> select * from AC_MASS_SPECTROMETRY where RECORD = 'MSBNK-IPB_Halle-PB010101';
+------+--------------------------+-----------------------+------------+
| ID   | RECORD                   | SUBTAG                | VALUE      |
+------+--------------------------+-----------------------+------------+
| 4820 | MSBNK-IPB_Halle-PB010101 | IONIZATION            | ESI        |
| 4821 | MSBNK-IPB_Halle-PB010101 | CAPILLARY_TEMPERATURE | 190 C      |
| 4822 | MSBNK-IPB_Halle-PB010101 | CAPILLARY_VOLTAGE     | 5000V      |
| 4823 | MSBNK-IPB_Halle-PB010101 | COLLISION_ENERGY      | 31.6361675 |
| 4824 | MSBNK-IPB_Halle-PB010101 | FRAGMENTATION_MODE    | CID        |
| 4825 | MSBNK-IPB_Halle-PB010101 | FUNNEL1_RF            | 200Vpp     |
| 4826 | MSBNK-IPB_Halle-PB010101 | FUNNEL2_RF            | 300VPP     |
| 4827 | MSBNK-IPB_Halle-PB010101 | HEXAPOLE              | RF 100VPP  |
| 4828 | MSBNK-IPB_Halle-PB010101 | COLLISION_CELL_RF     | 400VPP     |
| 4829 | MSBNK-IPB_Halle-PB010101 | TRANSFER_TIME         | 70µs       |
| 4830 | MSBNK-IPB_Halle-PB010101 | PREPULSE_STORAGE_TIME | 5µs        |
| 4831 | MSBNK-IPB_Halle-PB010101 | FRAGMENTATION_MODE    | CID        |
+------+--------------------------+-----------------------+------------+
12 rows in set (0.001 sec)

there is twice the FRAGMENTATION_MODE "CID" listed for this spectrum.

This happens for in total 21 records:

 [1] "MSBNK-IPB_Halle-PB010101" "MSBNK-IPB_Halle-PB010301"
 [3] "MSBNK-IPB_Halle-PB010401" "MSBNK-IPB_Halle-PB010501"
 [5] "MSBNK-IPB_Halle-PB010601" "MSBNK-IPB_Halle-PB010701"
 [7] "MSBNK-IPB_Halle-PB010801" "MSBNK-IPB_Halle-PB010901"
 [9] "MSBNK-IPB_Halle-PB011001" "MSBNK-IPB_Halle-PB011101"
[11] "MSBNK-IPB_Halle-PB011201" "MSBNK-IPB_Halle-PB011301"
[13] "MSBNK-IPB_Halle-PB011401" "MSBNK-IPB_Halle-PB011501"
[15] "MSBNK-IPB_Halle-PB011601" "MSBNK-IPB_Halle-PB011701"
[17] "MSBNK-IPB_Halle-PB011801" "MSBNK-IPB_Halle-PB011901"
[19] "MSBNK-IPB_Halle-PB012001" "MSBNK-IPB_Halle-PB012101"
[21] "MSBNK-IPB_Halle-PB012201"

would be nice if that could be fixed in the 2022.12 release as this causes errors in my scripts to query the MassBank database (where I expect only a single type of fragmentation mode per spectrum).

sneumann commented 1 year ago

Thanks for spotting. That is from an ancient perl script extracting spectral data from ACD SpecManager database. The duplication might be a result of merging FRAGMENTATION_MODE and FRAGMENTATION_METHOD Yours, Steffen

jorainer commented 1 year ago

Are you planning to fix that and provide an updated 2022.12 release? just to know if I should make an intermediate fix or wait for the official fix...

meier-rene commented 1 year ago

I will not fix existing releases, but rather release fixed version with new version number. If you can easily fix that for your dataset right now, then please do it. In addition to fixing this particular problem, I would also like to implement a automatic test which identifies similar problems for existing data and future contributions.

jorainer commented 1 year ago

Thanks @meier-rene for the update - do you know already when you will release the next version?

meier-rene commented 1 year ago

Asap, but first I would like to release the software stack. I would guess end of the week I might be done with fixing data.

meier-rene commented 1 year ago

The issue is fixed with 43e98fbf14 in dev. The issue you reported was the only one of that kind. Could not find any other duplicates. I want to wait for the answer of an contributor before I make a new data release. Release will be very soon.

jorainer commented 1 year ago

Perfect! Thanks! and yes, I also checked all records and these were the only ones with duplicates.

meier-rene commented 1 year ago

Data is ready to be released, but I cant do it. The merge to main branch requires the successful report from the CI pipeline. Unfortunately the maven repo, from which we pull the SPLASH library is down. Nothing I can do to fix that. I would like to wait a little bit, before I make major changes to build infrastructure...

jorainer commented 1 year ago

All good - just post here (or even better close the issue) once the data is released so I get notified automatically.

meier-rene commented 1 year ago

Solved with 2012.12.1 release. Thanks for reporting!