MassBank / MassBank-web

The web server application and directly connected components for a MassBank web server
13 stars 22 forks source link

Semicolons in CH$NAME #297

Closed meowcat closed 3 years ago

meowcat commented 3 years ago

Hi,

the MassBank record specification does not say that semicolons are forbidden in CH$NAME. When this is the case, it ends up in the record title, also passes the corresponding validation, and ends up incorrectly grouping compounds in the record index (and potentially elsewhere?) I noted this when making a personal MassBank from the LipidBlast Tsugawa version, which uses compound names such as

ST 27:1;O;Hex;FA 14:0

image

meier-rene commented 3 years ago

Thank you for your report. A fix will be included in the next release. The problem here is, that semicolon is a legal character in chemical names and we use semicolon as separation character for the title. I made the separation of the title fields now a little bit more precise by using "; ". To make it bullet proof I would have to reject all chemical names with a space behind the semicolon. I guess that would be ok in principle, I just dont know how to code that atm... I leave this issue open for a while until I finally fixed this.

meowcat commented 3 years ago

Thanks! Now looking at the code, I see where the problem comes from. Would it be much overhead to use a regex instead? This should work because the capture is greedy (I don't know off the top of my head the Java syntax):

library(tidyverse)
regex <- "(.*);(.*);(.*)"
name <- "ST 27:1;O;Hex;FA 14:0; LC-ESI-QTOF; MS2"
str_match(name, regex)
meier-rene commented 3 years ago

This issue is solved. Possible problematic records for web app will not pass the Validator. Visualization is fixed with my commit.