bmrb-io / PyNMRSTAR

A Python module for reading, writing, and manipulating NMR-STAR files.
MIT License
28 stars 3 forks source link

read rows in loops as a dictionary #111

Closed varioustoxins closed 8 months ago

varioustoxins commented 2 years ago

I have this function which makes working with rows in loops much easier, would its be possible to add it as a method on loop, or am I missing the right idiom for dealing with loops?

def _loop_row_dict_iter(loop: Loop):
    for row in loop:
        yield {tag.lower(): value for tag, value in zip(loop.tags,row)}

to be clear I was doing things like

loop = frame.loops[0]
tag_index_x = loop. tag_index('x')
for row in loop:
   x = row[tag_index_x]

which seems clunky

dmaziuk commented 2 years ago

If you're just reading a file looking for specific tag, SAS has a separate parser for this: one that returns tag-value pairs.

I think if you wanted to Do It Right(tm) you'd add a Row class to the loop -- like sqlite3 does it: optional via row_factory. Jon?

varioustoxins commented 2 years ago

? If you're just reading a file looking for specific tag, SAS has a separate parser for this ? - not quite sure I follow the sas reference

jonwedell commented 2 years ago

Here is something which exists already and I believe meets your needs:

>>> a = pynmrstar.Entry.from_database(15000)
>>> l = a[0][0]
>>> l.get_tag(['ordinal', 'family_name'], dict_result=True)
[{'Ordinal': '1', 'Family_name': 'Cornilescu'}, {'Ordinal': '2', 'Family_name': 'Cornilescu'}, {'Ordinal': '3', 'Family_name': 'Hadley'}, {'Ordinal': '4', 'Family_name': 'Gellman'}, {'Ordinal': '5', 'Family_name': 'Markley'}]

The only difference being that you need to specify which tags you want. The presumption here is that you don't need the dictionary to contain values your code doesn't know how to handle - since presumably they would just be ignored anyway. It also would raise an exception if you ask for a tag that isn't present, which would make it clear where an issue is right away, versus getting a generator of values and only realizing later that a particular key is missing with an IndexError.

If you really do need all the tags in the dictionary though, you could do the following:

l.get_tag(l.tags, dict_result=True)

Though that is admittedly a little clunky looking, it will work fine.

dmaziuk commented 2 years ago

? If you're just reading a file looking for specific tag, SAS has a separate parser for this ? - not quite sure I follow the sas reference

https://github.com/bmrb-io/SAS -- there is a python3 branch that passes basic tests.

jonwedell commented 2 years ago

Here is something which exists already and I believe meets your needs:

Though looking at this further, it really should return the tag names with the same capitalization you use to query them. Otherwise if your specified capitalization doesn't match the file, you'll run into an annoying discrepancy. I'll look into updating this code.

Does this meet your needs? If not, if you describe exactly what sort of operation you're performing on the loop as you iterate through the rows, I may be able to provide an idiomatic way to do it.

jonwedell commented 8 months ago

One other thing I realized I didn't mention on this issue before is that Loop.get_tag() can take None as the list of tags which means "all tags", so you can get a list of dictionaries for the loop tags via

Loop.get_tag(dict_result=True)
varioustoxins commented 8 months ago

My version is a bit friendlier for me as I have the basic conversions I need built in(str->int str->float) and can do a row at a time rather than slurp the whole lot (though who cares these days, SOOOO much memory). Forgotten I had sent my jiffy in before and #124 uses it...

nb one other question if I want to build my own schema for validation how do I do that as PyNMRStar doesn't read mmcic dics. Is there another tool I need (app to open a separate issue)

regards Gary

dmaziuk commented 8 months ago

On 3/6/24 11:34, varioustoxins wrote:

nb one other question if I want to build my own schema for validation how do I do that as PyNMRStar doesn't read mmcic dics. Is there another tool I need (app to open a separate issue)

https://github.com/bmrb-io/SAS has both mmcif and "ddl" (for pdbx .dic file) parsers. You'll likely need to hack the ddl one to work with your file, but it is relatively straightforward.

Dimitri

jonwedell commented 7 months ago

@varioustoxins - To be honest, while I had written the code to support different versions of the BMRB schema, I hadn't put much work into generic schema handling as it wasn't relevant. To wit, PyNMR-STAR loads a CSV used internally to generate the BMRB DDL rather than the BMRB DDL.

I just made a new release to improve the support of other schemas. There is still a caveat that you'll need to write it in CSV format rather than DDL, but I have attached an example here showing how straightforward it would be to convert your schema into CSV to use with PyNMR-STAR.

With 3.3.4:

Example loop file:

loop_
 _Test.Ordinal
 _Test.Name
 _Test.Value
 _Test.Description

 1 first_thing 1.2 'something very important'
 2 second_thing 1.99 'ignore this'
 3 way_too_long_of_name 3 'cannot be this long'
stop_

Example dictionary:

Dictionary sequence,Tag,Data Type,BMRB data type,Loopflag,Nullable,public,SFCategory,ADIT category view type
TBL_BEGIN,,,,,,,,v.1
10,_Test.Ordinal,INTEGER,int,Y,,Y,_Test,
20,_Test.Name,VARCHAR(12),code,Y,NOT NULL,Y,_Test,
30,_Test.Value,FLOAT,float,Y,NOT NULL,Y,_Test,
40,_Test.Description,TEXT,text,Y,,Y,_Test,
50,_Test.Verified,CHAR(3),yes_no,Y,NOT NULL,Y,_Test,
60,_Test.Internal,TEXT,line,Y,,I,_Test,
TBL_END,,,,,,,,
>>> import pynmrstar
>>> s = pynmrstar.Schema('schema.csv')
>>> l = pynmrstar.Loop.from_file('example.test', schema=s, convert_data_types=True)
>>> l.data
[[1, 'first_thing', Decimal('1.2'), 'something very important'], [2, 'second_thing', Decimal('1.99'), 'ignore this'], [3, 'way_too_long_of_name', Decimal('3'), 'cannot be this long']]
>>> print(s)
BMRB schema from: 'schema.csv' version 'v.1'

  Tag_Prefix  Tag   Type  Null_Allowed SF_Category

_Test                         
  Ordinal     INTEGER     True   _Test
  Name        VARCHAR(12) False  _Test
  Value       FLOAT       False  _Test
  Description TEXT        True   _Test
  Verified    CHAR(3)     False  _Test
  Internal    TEXT        True   _Test
>>> l.validate(schema=s)
["Length of '20' is too long for 'VARCHAR(12)': '_Test.Name':'way_too_long_of_name'."]

The example shows not just that you can validate using the specified dictionary, but if you use it in combination with the convert_data_types=True argument when parsing an Entry/Saveframe/Loop the data types are also converted automatically, according to the specified schema. That functionality has been present for a long time, but 3.3.4 lets you use a custom schema when parsing which wasn't previously supported. schema.csv example.txt

varioustoxins commented 6 months ago

Hi John

Thank you so much for the reply and sorry about the slow reply / comments

Comments below and one more question

Where can I get support on using the BMRB web api, I had some questions…

For example can I list all shift lists that contain CA C N CB* shifts without also downloading all the data

Regards Gary

Dr Gary S Thompson NMR Facility Manager CCPN CoI & Working Group Member Wellcome Trust Biomolecular NMR Facility School of Biosciences, Division of Natural Sciences University of Kent, Canterbury, Kent, England, CT2 7NZ

☎:01227 82 7117 ✉️: @.*** orchid: orcid.org/0000-0001-9399-7636

On 13 Mar 2024, at 20:06, Jon Wedell @.***> wrote:

You don't often get email from @.*** Learn why this is importanthttps://aka.ms/LearnAboutSenderIdentification

CAUTION: This email originated from outside of the organisation. Do not click links or open attachments unless you recognise the sender and know the content is safe.

@varioustoxinshttps://github.com/varioustoxins - To be honest, while I had written the code to support different versions of the BMRB schema, I hadn't put much work into generic schema handling as it wasn't relevant. To wit, PyNMR-STAR loads a CSV used internally to generate the BMRB DDL rather than the BMRB DDL.

;-)

I just made a new release to improve the support of other schemas. There is still a caveat that you'll need to write it in CSV format rather than DDL, but I have attached an example here showing how straightforward it would be to convert your schema into CSV to use with PyNMR-STAR.

Thank you!

With 3.3.4:

Example loop file:

loop_ _Test.Ordinal _Test.Name _Test.Value _Test.Description

1 first_thing 1.2 'something very important' 2 second_thing 1.99 'ignore this' 3 way_too_long_ofname 3 'cannot be this long' stop

Example dictionary:

Dictionary sequence,Tag,Data Type,BMRB data type,Loopflag,Nullable,public,SFCategory,ADIT category view type TBL_BEGIN,,,,,,,,v.1 10,_Test.Ordinal,INTEGER,int,Y,,Y,_Test, 20,_Test.Name,VARCHAR(12),code,Y,NOT NULL,Y,_Test, 30,_Test.Value,FLOAT,float,Y,NOT NULL,Y,_Test, 40,_Test.Description,TEXT,text,Y,,Y,_Test, 50,_Test.Verified,CHAR(3),yes_no,Y,NOT NULL,Y,_Test, 60,_Test.Internal,TEXT,line,Y,,I,_Test, TBL_END,,,,,,,,

Ok lost of questions here!

  1. I presume the saveframe type is in SFCategory so this is _Test in _Test
  2. I presume sequence increases monotonically and the gaps of 10 are to allow updates? So does the sequence number matter across versions
  3. What are the definitions for datatypes and where are they defined I presume this is a subset of SQL by the look of the thing and the BMRB datatype is a type alias
  4. I presume loop flag is N for the metadata for a frame and at this point tag should be empty, but I am just guessing!
  5. I presume NULLABLE is non mandatory but indicates if the value can be empty ie ‘.’ is allowed but again I. Am guessing
  6. Not sure about public!
  7. I presume I can ignore the ADIT category? Or do I need to set it to v.1 for compatibility

import pynmrstar s = pynmrstar.Schema('schema.csv') l = pynmrstar.Loop.from_file('example.test', schema=s, convert_data_types=True) l.data [[1, 'first_thing', Decimal('1.2'), 'something very important'], [2, 'second_thing', Decimal('1.99'), 'ignore this'], [3, 'way_too_long_of_name', Decimal('3'), 'cannot be this long']] print(s) BMRB schema from: 'schema.csv' version 'v.1'

Tag_Prefix Tag Type Null_Allowed SF_Category

_Test Ordinal INTEGER True _Test Name VARCHAR(12) False _Test Value FLOAT False _Test Description TEXT True _Test Verified CHAR(3) False _Test Internal TEXT True _Test

l.validate(schema=s) ["Length of '20' is too long for 'VARCHAR(12)': '_Test.Name':'way_too_long_of_name'."]

The example shows not just that you can validate using the specified dictionary, but if you use it in combination with the convert_data_types=True argument when parsing an Entry/Saveframe/Loop the data types are also converted automatically, according to the specified schema. That functionality has been present for a long time, but 3.3.4 lets you use a custom schema when parsing which wasn't previously supported. schema.csvhttps://github.com/bmrb-io/PyNMRSTAR/files/14593398/schema.csv example.txthttps://github.com/bmrb-io/PyNMRSTAR/files/14593399/example.txt

— Reply to this email directly, view it on GitHubhttps://github.com/bmrb-io/PyNMRSTAR/issues/111#issuecomment-1995616444, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA3UD6K76GYOZITVU3P5IBLYYCWSVAVCNFSM56MQZOW2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJZGU3DCNRUGQ2A. You are receiving this because you were mentioned.Message ID: @.***>

jonwedell commented 6 months ago
  1. I presume the saveframe type is in SFCategory so this is _Test in _Test

Yes.

  1. I presume sequence increases monotonically and the gaps of 10 are to allow updates? So does the sequence number matter across versions.

Exactly. The number is completely arbitrary, it just must increase from row to row. It can change from version to version without consequence.

  1. What are the definitions for datatypes and where are they defined I presume this is a subset of SQL by the look of the thing and the BMRB datatype is a type alias

The supported Data Types: https://github.com/bmrb-io/PyNMRSTAR/blob/c84160cf024aeabeafa77394a8d629b620341d2d/pynmrstar/schema.py#L111

The BMRB Data Type: https://github.com/bmrb-io/PyNMRSTAR/blob/v3/pynmrstar/reference_files/data_types.csv

  1. I presume loop flag is N for the metadata for a frame and at this point tag should be empty, but I am just guessing!

Y for tags that are part of a loop, N for tags that are part of a saveframe.

  1. I presume NULLABLE is non mandatory but indicates if the value can be empty ie ‘.’ is allowed but again I. Am guessing

Indeed.

  1. Not sure about public!

This is mainly internal - there are non-public tags that are stripped when NMR-STAR files are released. For your use case, it probably makes sense to set every tag to Y.

  1. I presume I can ignore the ADIT category? Or do I need to set it to v.1 for compatibility

The rows can have null in the ADIT category column but you should keep the v.1 (or something, whatever you want) in the second row, this is required.

For your question about accessing chemical shift data, please send me an e-mail and I'll be happy to follow up with more information. I'd prefer to keep GitHub issues for bugs/feature requests.

Cheers, Jon