bmrb-io / nmr-star-dictionary

The official NMR-STAR dictionary in use by the BMRB.
3 stars 3 forks source link

Dictionary: Data item starting with Data should be quoted (possibly) #65

Open epeisach opened 3 years ago

epeisach commented 3 years ago

I am not sure if STAR has this requirement - but cif does. loop,, data, save etc. are case insensitive.

_Category.Sf_framecode Data_set

I believe Dataset needs to be quoted as Data is interpreted as the start of a new data block.

pynmrstar checks for case insensitive for the start of the file

if not self.token.lower().startswith("data_"):

so it appears that case sensitivity is there. It is just the used of RESERVED_TOKENS in which it is case sensitive.

So - I am not sure what the NMR* DDL allows for here - but quoting would be safest.

dmaziuk commented 3 years ago

The .dic file was generated from Excel spreadsheet by some VB/VBA scripts that are not part of our "standard" toolchain. It's lacking many things we have in NMR-STAR outputters, looks like one of them is quoting reserved words. (You're finding the others.)

As for case-sensitivity, tag names map to DB table.column names and some of them are reserved SQL words, so we have to quote them anyway because of that. That has case-sensitivity as the side-effect: they come out of the database in the same camel case they went in. We've discussed that at one point and I even have a script to check the tags against SQL reserved words, but we ended up leaving things as is. (Part of the problem is dep. on whose SQL server you check against, you may end up having to rename a lot of tags.) As a result, tag names are not case-sensitive in principle, but IRL they are.

But data, save, loop, stop are not case-sensitive in our parsers and should be printed out in lowercase by all outputters (that I know of).

jonwedell commented 3 years ago

I am not sure if STAR has this requirement - but cif does. loop,, data, save etc. are case insensitive. so it appears that case sensitivity is there. It is just the used of RESERVED_KEYWORDS in which it is case sensitive.

A recent update to PyNMR-STAR ensured that those keywords are treated case-insensitively - in most places. Looks like you've found the one place I missed. (The RESERVED_KEYWORDS definition is only used when checking for reserved keywords in places that data would expect to be found - actual keywords are accepted regardless of case, as you noticed for the data block parsing.)

I believe Dataset needs to be quoted as Data is interpreted as the start of a new data block.

Indeed, and Py-NMRSTAR does properly quote such values, regardless of case. (And since we're using it in the refactoring of the dictionary code, this issue should be resolved automatically.) So like the other bug, this should be fixed soon, and I'll close the ticket when it is.

>>> a['test'] =  'Data_set'
>>> a['test2'] = 'dataa_set'
>>> print(a)
save_test
   my_test.test   'Data_set'
   my_test.test2  dataa_set

save_
jonwedell commented 3 years ago

PyNMR-STAR is updated, but I don't plan to do a new release immediately just for this minor issue unless it's causing you issues, @epeisach .

Thanks again for your great bug reports.