MassBank / MassBank-data

Official repository of open data MassBank records
74 stars 59 forks source link

Duplicated records #53

Open Treutler opened 5 years ago

Treutler commented 5 years ago

I just stumbled over two records, which seem to be duplicates. Meta data as well as the spectrum is exactly the same. https://massbank.eu/MassBank/RecordDisplay.jsp?id=TY000228&dsn=Univ_Toyama https://massbank.eu/MassBank/RecordDisplay.jsp?id=TY000237&dsn=Univ_Toyama Maybe it is worth to search MassBank globally for such cases. I guess we will have to contact the contributors in any case.

How to tackle this? I suggest to introduce a "DEPRECATED" tag for records which are duplicated (this issue) or noisy (e.g. #51) or otherwise erroneous (#9).

schymane commented 5 years ago

Yes to a DEPRECATED tag ... I think this will help us keep the record IDs live but communicate beyond COMMENT that there is an issue.... if we hide this in COMMENT tags information will get lost as several records have several COMMENTs

We should do a global check for duplicates, I found some UF cases that are likely duplicates too: Butylparaben UF4158 records and UF4234 records? I did not do a 1:1 match, but they were flagged by Oberacher and have identical "scores" in his results .. You can check by SPLASH?

I am going to comment some validation suggestions on the Validator issue shortly ...