MassBank / MassBank-data

Official repository of open data MassBank records
68 stars 55 forks source link

Request merge of BfG Spectra to MassBank dev #245

Closed ksjewell closed 8 months ago

ksjewell commented 8 months ago

These are all current spectra from the BfG in the current library. The spectra were processed with RMassBank (without mass recalibration, since this did not increase accuracy) and then loaded to an internal SQLite DB. Here they are further curated. They are then exported from the internal SQLite DB using the Spectra package and MsBackendMassBank.

Naming: The assension number is made up of the date of export and the ID from the internal SQLite DB (to avoid any duplications). CSL stands for collective spectral library, the name of the internal SQLite DB. But in this commit I only included the BfG spectra.

I also changed the list of contributors to include the BfG.

sneumann commented 8 months ago

Hi Kevin, thanks for the PR, we'll have a look at it and come back to you. Yours, Steffen

meowcat commented 8 months ago

Hi Kevin, 0) this is not official feedback since I am not René or Steffen 1) thanks for this awesome block of data 2) on a first glance, I see COMMENT: CONFINDENCE (instead of COMMENT: CONFIDENCE). Now this is not a strictly regulated subtag but it would be nice to have it lined up with existing data (and it's even in the record format) 3) There's an extra space after AUTHORS 4) Is it possible to use CC licenses instead of whatever "dl-de" is? That's what's currently required on massbank.eu, afaik. https://github.com/MassBank/MassBank-web/blob/dev/Documentation/MassBankRecordFormat.md#2.1.5 @meier-rene opinion?

ksjewell commented 8 months ago

Thanks for the corrections! I corrected the script on points 1 to 3 and I will update my branch.

Regarding point 4 I will wait for René's opinion. It is okay to change it, but if it doesn't matter we would prefer dl-de/by-2-0.

schymane commented 8 months ago

Is this it? https://www.govdata.de/dl-de/by-2-0 If we go with this, is it possible to hyperlink to the license, since not everyone may be familiar with it?

ksjewell commented 8 months ago

Yes, that's the one. Sure, how do I turn it into a hyperlink? Should it look like this?: LICENSE: (dl-de/by-2-0)[www.govdata.de/dl-de/by-2-0]

meier-rene commented 8 months ago

Hi, just a little comment on the license issue: As already pointed out, we encourage people to use a open license from the CC. We were not aware of the dl-de/by-2-0. This one looks also very liberal and is of course ok. You don't need to do anything about the link to the license. Now that I know about it, I will add that link to the web app and it will add that link to the html representation of your records. Please just stick with "LICENSE: dl-de/by-2-0". All you need to do, @ksjewell, is to solve the other issues mentioned in comment 3 and push it. I will than happily merge your data.

And now a little bit more general comment about licensing and acquiring a copyright to data. Our law experts in the NFDI4Chem tell us, that it is not possible to hold the copyright on measurement data at all. Only if its arranged or evaluated a copyright can be hold on the process. So I expect that all data an MassBank can not be subject to a copyright of anyone. Nevertheless we still have this LICENSE field in our data format. Its there because it has been for a long time, the inventors of the format were propably not aware of the fact that data is not copyrightable and we don't push to remove that. It also forces contributors to think about the fact, that their data might get compiled and reused in a different place. We think the "reuse" is also a purpose of this collection of data. But if contributor provide their data with a open license with the properties " ShareAlike" and " Attribution" and a consumer uses the data and does not follow this rules it will be probably impossible to enforce that.

So as a bottom line: The LICENSE tag might be superfluous and is a bit misleading but there was no initiative to remove that. It has no legal consequences for any consumer of the data and can only be considered as a wish.

meowcat commented 8 months ago

@meier-rene Cool, I'm happy that the licensing will not be in the way of this!

I think then you'll have to fix this in https://github.com/MassBank/MassBank-web/blob/dev/MassBank-Project/MassBank-lib/src/main/resources/recordformat/license.ini since the licenses are checked during record validation: https://github.com/MassBank/MassBank-web/blob/18d97bd21632e0c5c521ec4b19f7099200007807/MassBank-Project/MassBank-lib/src/main/java/massbank/RecordParserDefinition.java#L423-L451

meier-rene commented 8 months ago

I took the time to look a bit closer to the data. I was not aware of the size of the contribution. Thank you for your effort!

But: All records miss the SMILES, but you provide the InChi for nearly all records. So the information is there. Nevertheless we require the SMILES in the record.

My question is now: Do you want to fix your pipeline, because you want to contribute more data in the future? Or do you just want to get the data deposited as fast as possible and never want to touch it again? :wink: If you want a working pipeline you need to make sure that the SMILES are also in the record file. Otherwise you can ask me to try to add that line from the given InChI.

schymane commented 8 months ago

Adding SMILES from InChI is non-ideal, since it makes some tautomer assumptions that are not always true to the original SMILES. If they are not added, this will break our MassBank / PubChem integration...

schymane commented 8 months ago

But I see SMILES in BFG000001?

ksjewell commented 8 months ago

Thanks for all the help! I'll update the branch now. @meier-rene, I have the SMILES in the original database, so I will just copy them from there (these were the original values used in RMassbank).