cgoliver / rnaglib

Datasets and analysis tools for RNA 3D and 2.5D structures.
https://rnaglib.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
12 stars 2 forks source link

including non-standard residues in fr3d2graph #7

Closed ghliu521 closed 3 days ago

ghliu521 commented 1 month ago

Including non-standard residues in the backbone of RNA #6

cgoliver commented 1 month ago

Thanks very much for this @ghliu521 !

Could you poste a snippet for testing the function with expected output?

As well as some details on the new argument XNALinking?

cgoliver commented 1 month ago

Thanks for this @ghliu521 !

Could you post a snippet with an example usage and expected output?

Also could you provide details on the new argument XNALinking?

ghliu521 commented 1 month ago

@cgoliver Sorry for the late reply because I am kind of busy recently. As for the argument XNA_Linking, it contains all the residues that I collect from one mmcif file, including standard and non-standard monomers(nucleotides). They are either RNA linking or DNA inking, just as annotations are done in fr3d-python.

I am trying to finish the conversion from non-standard residues to standard ones when one mmcif file is processed in the fr3d_2_graphs method. Hopefully a new commit will be submitted later.

ghliu521 commented 1 month ago

Hi @cgoliver I have finished the conversion part. And I have tested the fr3d_2_graphs method with the example file 1evv.cif:

from rnaglib.prepare_data import fr3d_to_graph

g = fr3d_to_graph("./rnaglib/data/1evv.cif") 

print(g.nodes(data=True))

The expected output: [('1evv.A.2', {'nt': 'C', 'xyz_P': [27.309999465942383, 5.218999862670898, 54.94300079345703]}), ('1evv.A.1', {'nt': 'G', 'xyz_P': [22.399999618530273, 4.864999771118164, 49.77799987792969]}), ('1evv.A.3', {'nt': 'G', 'xyz_P': [32.3390007019043, 4.709000110626221, 54.737998962402344]}), ('1evv.A.4', {'nt': 'G', 'xyz_P': [37.4109992980957, 3.0999999046325684, 51.72999954223633]}), ('1evv.A.5', {'nt': 'A', 'xyz_P': [40.880001068115234, 1.1269999742507935, 47.064998626708984]}), ('1evv.A.6', {'nt': 'U', 'xyz_P': [41.667999267578125, 0.6299999952316284, 41.60300064086914]}), ('1evv.A.7', {'nt': 'U', 'xyz_P': [40.01499938964844, 2.384999990463257, 36.176998138427734]}), ('1evv.A.8', {'nt': 'U', 'xyz_P': [36.57699966430664, 4.374000072479248, 31.006000518798828]}), ('1evv.A.9', {'nt': 'A', 'xyz_P': [37.41299819946289, 5.688000202178955, 26.253999710083008]}), ('1evv.A.10', {'nt': 'G', 'xyz_P': [34.18199920654297, 6.440999984741211, 21.229000091552734]}), ('1evv.A.11', {'nt': 'C', 'xyz_P': [32.05500030517578, 1.4620000123977661, 23.9060001373291]}), ('1evv.A.12', {'nt': 'U', 'xyz_P': [33.62099838256836, -3.236999988555908, 26.913000106811523]}), ('1evv.A.13', {'nt': 'C', 'xyz_P': [38.17100143432617, -4.864999771118164, 30.31100082397461]}), ('1evv.A.14', {'nt': 'A', 'xyz_P': [44.20199966430664, -2.2860000133514404, 32.428001403808594]}), ('1evv.A.15', {'nt': 'G', 'xyz_P': [49.26499938964844, 0.4020000100135803, 33.474998474121094]}), ('1evv.A.16', {'nt': 'U', 'xyz_P': [51.1510009765625, 5.354000091552734, 35.14400100708008]}), ('1evv.A.17', {'nt': 'U', 'xyz_P': [57.685001373291016, 7.339000225067139, 35.29199981689453]}), ('1evv.A.18', {'nt': 'G', 'xyz_P': [54.83300018310547, 12.25100040435791, 38.327999114990234]}), ('1evv.A.19', {'nt': 'G', 'xyz_P': [54.40700149536133, 14.854000091552734, 32.82899856567383]}), ('1evv.A.20', {'nt': 'G', 'xyz_P': [51.70500183105469, 17.384000778198242, 27.05299949645996]}), ('1evv.A.21', {'nt': 'A', 'xyz_P': [49.707000732421875, 16.444000244140625, 22.808000564575195]}), ('1evv.A.22', {'nt': 'G', 'xyz_P': [48.67499923706055, 10.680999755859375, 20.78700065612793]}), ('1evv.A.23', {'nt': 'A', 'xyz_P': [48.25, 4.244999885559082, 18.85700035095215]}), ('1evv.A.24', {'nt': 'G', 'xyz_P': [48.09299850463867, -1.1720000505447388, 17.277000427246094]}), ('1evv.A.25', {'nt': 'C', 'xyz_P': [45.14500045776367, -5.419000148773193, 15.579999923706055]}), ('1evv.A.26', {'nt': 'G', 'xyz_P': [39.678001403808594, -6.498000144958496, 13.46500015258789]}), ('1evv.A.27', {'nt': 'C', 'xyz_P': [35.3849983215332, -4.3429999351501465, 10.76099967956543]}), ('1evv.A.28', {'nt': 'C', 'xyz_P': [34.51300048828125, -0.6380000114440918, 6.425000190734863]}), ('1evv.A.29', {'nt': 'A', 'xyz_P': [36.83000183105469, 1.694000005722046, 1.7419999837875366]}), ('1evv.A.30', {'nt': 'G', 'xyz_P': [41.66600036621094, 2.3469998836517334, -2.0230000019073486]}), ('1evv.A.31', {'nt': 'A', 'xyz_P': [46.500999450683594, 0.777999997138977, -4.408999919891357]}), ('1evv.A.32', {'nt': 'C', 'xyz_P': [50.06100082397461, -3.171999931335449, -5.831999778747559]}), ('1evv.A.33', {'nt': 'U', 'xyz_P': [49.24300003051758, -8.814000129699707, -7.105000019073486]}), ('1evv.A.34', {'nt': 'G', 'xyz_P': [45.92900085449219, -11.723999977111816, -9.935999870300293]}), ('1evv.A.35', {'nt': 'A', 'xyz_P': [41.44200134277344, -8.067000389099121, -9.142999649047852]}), ('1evv.A.36', {'nt': 'A', 'xyz_P': [38.82099914550781, -7.609000205993652, -4.585000038146973]}), ('1evv.A.37', {'nt': 'G', 'xyz_P': [37.928001403808594, -8.086000442504883, 1.4240000247955322]}), ('1evv.A.38', {'nt': 'A', 'xyz_P': [40.00299835205078, -10.645999908447266, 6.697999954223633]}), ('1evv.A.39', {'nt': 'U', 'xyz_P': [44.86899948120117, -10.86299991607666, 9.282999992370605]}), ('1evv.A.40', {'nt': 'C', 'xyz_P': [50.10900115966797, -7.948999881744385, 10.151000022888184]}), ('1evv.A.41', {'nt': 'U', 'xyz_P': [52.98400115966797, -2.7179999351501465, 9.538000106811523]}), ('1evv.A.42', {'nt': 'G', 'xyz_P': [52.301998138427734, 2.5230000019073486, 9.652000427246094]}), ('1evv.A.43', {'nt': 'G', 'xyz_P': [48.93299865722656, 7.697999954223633, 10.29800033569336]}), ('1evv.A.44', {'nt': 'A', 'xyz_P': [44.564998626708984, 11.277999877929688, 11.269000053405762]}), ('1evv.A.45', {'nt': 'G', 'xyz_P': [39.209999084472656, 12.86400032043457, 13.737000465393066]}), ('1evv.A.46', {'nt': 'G', 'xyz_P': [36.58700180053711, 12.184000015258789, 18.27199935913086]}), ('1evv.A.47', {'nt': 'U', 'xyz_P': [34.24800109863281, 12.630000114440918, 23.777000427246094]}), ('1evv.A.48', {'nt': 'C', 'xyz_P': [37.957000732421875, 14.699000358581543, 28.3799991607666]}), ('1evv.A.49', {'nt': 'C', 'xyz_P': [38.202999114990234, 9.496000289916992, 33.51100158691406]}), ('1evv.A.50', {'nt': 'U', 'xyz_P': [35.06800079345703, 13.907999992370605, 32.784000396728516]}), ('1evv.A.51', {'nt': 'G', 'xyz_P': [33.58100128173828, 19.033000946044922, 34.34299850463867]}), ('1evv.A.52', {'nt': 'U', 'xyz_P': [34.474998474121094, 23.722999572753906, 37.18199920654297]}), ('1evv.A.53', {'nt': 'G', 'xyz_P': [38.09299850463867, 26.700000762939453, 40.76900100708008]}), ('1evv.A.54', {'nt': 'U', 'xyz_P': [43.41699981689453, 27.26799964904785, 43.895999908447266]}), ('1evv.A.55', {'nt': 'U', 'xyz_P': [48.665000915527344, 27.288000106811523, 44.04499816894531]}), ('1evv.A.56', {'nt': 'C', 'xyz_P': [52.641998291015625, 28.167999267578125, 40.48899841308594]}), ('1evv.A.57', {'nt': 'G', 'xyz_P': [49.132999420166016, 27.39699935913086, 36.395999908447266]}), ('1evv.A.58', {'nt': 'A', 'xyz_P': [46.52899932861328, 22.816999435424805, 33.9640007019043]}), ('1evv.A.59', {'nt': 'U', 'xyz_P': [45.262001037597656, 16.645000457763672, 31.763999938964844]}), ('1evv.A.60', {'nt': 'C', 'xyz_P': [44.03099822998047, 14.491999626159668, 36.814998626708984]}), ('1evv.A.61', {'nt': 'C', 'xyz_P': [47.071998596191406, 10.5649995803833, 41.332000732421875]}), ('1evv.A.62', {'nt': 'A', 'xyz_P': [45.50299835205078, 12.430999755859375, 46.25899887084961]}), ('1evv.A.63', {'nt': 'C', 'xyz_P': [41.483001708984375, 13.95199966430664, 49.82400131225586]}), ('1evv.A.64', {'nt': 'A', 'xyz_P': [35.66299819946289, 14.458999633789062, 50.85100173950195]}), ('1evv.A.65', {'nt': 'G', 'xyz_P': [30.697999954223633, 13.312000274658203, 49.316001892089844]}), ('1evv.A.66', {'nt': 'A', 'xyz_P': [26.774999618530273, 10.810999870300293, 45.45500183105469]}), ('1evv.A.67', {'nt': 'A', 'xyz_P': [24.660999298095703, 7.250999927520752, 41.51300048828125]}), ('1evv.A.68', {'nt': 'U', 'xyz_P': [24.909000396728516, 2.0460000038146973, 38.847999572753906]}), ('1evv.A.69', {'nt': 'U', 'xyz_P': [26.643999099731445, -3.322999954223633, 38.61899948120117]}), ('1evv.A.70', {'nt': 'C', 'xyz_P': [28.798999786376953, -7.763000011444092, 41.58000183105469]}), ('1evv.A.71', {'nt': 'G', 'xyz_P': [29.988000869750977, -10.722999572753906, 46.59299850463867]}), ('1evv.A.72', {'nt': 'C', 'xyz_P': [29.770999908447266, -10.899999618530273, 52.994998931884766]}), ('1evv.A.73', {'nt': 'A', 'xyz_P': [27.159000396728516, -9.340999603271484, 57.689998626708984]}), ('1evv.A.74', {'nt': 'C', 'xyz_P': [22.25200080871582, -7.395999908447266, 59.90800094604492]}), ('1evv.A.75', {'nt': 'C', 'xyz_P': [16.673999786376953, -6.479000091552734, 59.479000091552734]}), ('1evv.A.76', {'nt': 'A', 'xyz_P': [12.437999725341797, -9.119000434875488, 57.81999969482422]})]

The older output: missed the modified residues like 1evv.A.10, 1evv.A.26, 1evv.A.40, 1evv.A.46, 1evv.A.49, 1evv.A.54, 1evv.A.55, 1evv.A.58

ghliu521 commented 1 month ago

Besides, I provide the files and methods of converting non-standard residues to standards. But the file components.cif exceeds 100 MB,the first time I pushed I failed. So I had to push twice.

cgoliver commented 3 days ago

Hi @ghliu521 ! Sorry for the delay. Finally getting around to merging your pull request. It is working smoothly on my end. Big thank you again for this, it looks great!

Just had a question regarding the final encoding of the nucleotides to make sure I understand.

It seems that in the get_residue_list() function of fr3d_2_graphs.py the ID of the non-standard nucleotides is saved as nt[2:] so something like 'OMG' would be saved as just 'G'. Am I understanding this correctly? Do you have somewhere else where the original modification is stored so we don't lose that information? If not, I'll just create a new key called something like 'nt_full'.

cgoliver commented 3 days ago

Ok understood it now I added an extra key to the dictionary that preserves the original 3 letter ID. Will merge.

ghliu521 commented 3 days ago

nt_full

Thanks for merging my pr. It's reassonable to keep the three-letter-code of the nucleotides in the built graph.