Closed jamesdalg closed 1 year ago
The sequence is known for all chains. Is there some way to fix this with the fasta file (which gives the amino acid sequence)?
>5FLM_2|Chain B|DNA-DIRECTED RNA POLYMERASE II SUBUNIT RPB2|BOS TAURUS (9913)
MYDADEDMQYDEDDDEITPDLWQEACWIVISSYFDEKGLVRQQLDSFDEFIQMSVQRIVEDAPPIDLQAEAQHASGEVEEPPRYLLKFEQIYLSKPTHWERDGAPSPMMPNEARLRNLTYSAPLYVDITKTVIKEGEEQLQTQHQKTFIGKIPIMLRSTYCLLNGLTDRDLCELNECPLDPGGYFIINGSEKVLIAQEKMATNTVYVFAKKDSKYAYTGECRSCLENSSRPTSTIWVSMLARGGQGAKKSAIGQRIVATLPYIKQEVPIIIVFRALGFVSDRDILEHIIYDFEDPEMMEMVKPSLDEAFVIQEQNVALNFIGSRGAKPGVTKEKRIKYAKEVLQKEMLPHVGVSDFCETKKAYFLGYMVHRLLLAALGRRELDDRDHYGNKRLDLAGPLLAFLFRGMFKNLLKEVRIYAQKFIDRGKDFNLELAIKTRIISDGLKYSLATGNWGDQKKAHQARAGVSQVLNRLTFASTLSHLRRLNSPIGRDGKLAKPRQLHNTLWGMVCPAETPEGHAVGLVKNLALMAYISVGSQPSPILEFLEEWSMENLEEISPAAIADATKIFVNGCWVGIHKDPEQLMNTLRKLRRQMDIIVSEVSMIRDIREREIRIYTDAGRICRPLLIVEKQKLLLKKRHIDQLKEREYNNYSWQDLVASGVVEYIDTLEEETVMLAMTPDDLQEKEVAYCSTYTHCEIHPSMILGVCASIIPFPDHNQSPRNTYQSAMGKQAMGVYITNFHVRMDTLAHVLYYPQKPLVTTRSMEYLRFRELPAGINSIVAIASYTGYNQEDSVIMNRSAVDRGFFRSVFYRSYKEQESKKGFDQEEVFEKPTRETCQGMRHAIYDKLDDDGLIAPGVRVSGDDVIIGKTVTLPENEDELEGTNRRYTKRDCSTFLRTSETGIVDQVMVTLNQEGYKFCKIRVRSVRIPQIGDKFASRHGQKGTCGIQYRQEDMPFTCEGITPDIIINPHAIPSRMTIGHLIECLQGKVSANKGEIGDATPFNDAVNVQKISNLLSDYGYHLRGNEVLYNGFTGRKITSQIFIGPTYYQRLKHMVDDKIHSRARGPIQILNRQPMEGRSRDGGLRFGEMERDCQIAHGAAQFLRERLFEASDPYQVHVCNLCGIMAIANTRTHTYECRGCRNKTQISLVRMPYACKLLFQELMSMSIAPRMMSV
>5FLM_14|Chain N[auth P]|RNA, DNA-RNA ELONGATION SCAFFOLD|SYNTHETIC CONSTRUCT (32630)
UAUAUGCAUAAAGACCAGGC
>5FLM_1|Chain A|DNA-DIRECTED RNA POLYMERASE|BOS TAURUS (9913)
MHGGGPPSGDSACPLRTIKRVQFGVLSPDELKRMSVTEGGIKYPETTEGGRPKLGGLMDPRQGVIERTGRCQTCAGNMTECPGHFGHIELAKPVFHVGFLVKTMKVLRCVCFFCSKLLVDSNNPKIKDILAKSKGQPKKRLTHVYDLCKGKNICEGGEEMDNKFGVEQPEGDEDLTKEKGHGGCGRYQPRIRRSGLELYAEWKHVNEDSQEKKILLSPERVHEIFKRISDEECFVLGMEPRYARPEWMIVTVLPVPPLSVRPAVVMQGSARNQDDLTHKLADIVKINNQLRRNEQNGAAAHVIAEDVKLLQFHVATMVDNELPGLPRAMQKSGRPLKSLKQRLKGKEGRVRGNLMGKRVDFSARTVITPDPNLSIDQVGVPRSIAANMTFAEIVTPFNIDRLQELVRRGNSQYPGAKYIIRDNGDRIDLRFHPKPSDLHLQTGYKVERHMCDGDIVIFNRQPTLHKMSMMGHRVRILPWSTFRLNLSVTTPYNADFDGDEMNLHLPQSLETRAEIQELAMVPRMIVTPQSNRPVMGIVQDTLTAVRKFTKRDVFLERGEVMNLLMFLSTWDGKVPQPAILKPRPLWTGKQIFSLIIPGHINCIRTHSTHPDDEDSGPYKHISPGDTKVVVENGELIMGILCKKSLGTSAGSLVHISYLEMGHDITRLFYSNIQTVINNWLLIEGHTIGIGDSIADSKTYQDIQNTIKKAKQDVIEVIEKAHNNELEPTPGNTLRQTFENQVNRILNDARDKTGSSAQKSLSEYNNFKSMVVSGAKGSKINISQVIAVVGQQNVEGKRIPFGFKHRTLPHFIKDDYGPESRGFVENSYLAGLTPTEFFFHAMGGREGLIDTAVKTAETGYIQRRLIKSMESVMVKYDATVRNSINQVVQLRYGEDGLAGESVEFQNLATLKPSNKAFEKKFRFDYTNERALRRTLQEDLVKDVLSNAHIQNELEREFERMREDREVLRVIFPTGDSKVVLPCNLLRMIWNAQKIFHINPRLPSDLHPIKVVEGVKELSKKLVIVNGDDPLSRQAQENATLLFNIHLRSTLCSRRMAEEFRLSGEAFDWLLGEIESKFNQAIAHPGEMVGALAAQSLGEPATQMTLNTFHYAGVSAKNVTLGVPRLKELINISKKPKTPSLTVFLLGQSARDAERAKDILCRLEHTTLRKVTANTAIYYDPNPQSTVVAEDQEWVNVYYEMPDFDVARISPWLLRVELDRKHMTDRKLTMEQIAEKINAGFGDDLNCIFNDDNAEKLVLRIRIMNSDENKMQEEEEVVDKMDDDVFLRCIESNMLTDMTLQGIEQISKVYMHLPQTDNKKKIIITEDGEFKALQEWILETDGVSLMRVLSEKDVDPVRTTSNDIVEIFTVLGIEAVRKALERELYHVISFDGSYVNYRHLALLCDTMTCRGHLMAITRHGVNRQDTGPLMKCSFEETVDVLMEAAAHGESDPMKGVSENIMLGQLAPAGTGCFDLLLDAEKCKYGMEIPTNIPGLGAAGPTGMFFGSAPSPMGGISPAMTPWNQGATPAYGAWSPSVGSGMTPGAAGFSPSAASDASGFSPGYSPAWSPTPGSPGSPGPSSPYIPSPGGAMSPSYSPTSPAYEPRSPGGYTPQSPSYSPTSPSYSPTSPSYSPTSPNYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPSYSPTSPNYSPTSPNYTPTSPSYSPTSPSYSPTSPNYTPTSPNYSPTSPSYSPTSPSYSPTSPSYSPSSPRYTPQSPTYTPSSPSYSPSSPSYSPTSPKYTPTSPSYSPSSPEYTPTSPKYSPTSPKYSPTSPKYSPTSPTYSPTTPKYSPTSPTYSPTSPVYTPTSPKYSPTSPTYSPTSPKYSPTSPTYSPTSPKGSTYSPTSPGYSPTSPTYSLTSPAISPDDSDDEN
>5FLM_10|Chain J|DNA-DIRECTED RNA POLYMERASES I, II, AND III SUBUNIT RPABC5|BOS TAURUS (9913)
MIIPVRCFTCGKIVGNKWEAYLGLLQAEYTEGDALDALGLKRYCCRRMLLAHVDLIEKLLNYAPLEK
>5FLM_11|Chain K|DNA-DIRECTED RNA POLYMERASE II SUBUNIT RPB11|BOS TAURUS (9913)
MNAPPAFESFLLFEGEKKITINKDTKVPNACLFTINKEDHTLGNIIKSQLLKDPQVLFAGYKVPHPLEHKIIIRVQTTPDYSPQEAFTNAITDLISELSLLEERFRVAIKDKQEGIE
>5FLM_12|Chain L|DNA-DIRECTED RNA POLYMERASES I, II, AND III SUBUNIT RPABC4|BOS TAURUS (9913)
MDTQKDVQPPKQQPMIYICGECHTENEIKSRDPIRCRECGYRIMYKKRTKRLVVFDAR
>5FLM_13|Chain M[auth N]|DNA, DNA-RNA ELONGATION SCAFFOLD|SYNTHETIC CONSTRUCT (32630)
GGCAGTACTAGTAAACTAGTATTGAAAGTACTTGAGCTT
>5FLM_15|Chain O[auth T]|DNA, DNA-RNA ELONGATION SCAFFOLD|SYNTHETIC CONSTRUCT (32630)
AAGCTCAAGTACTTAAGCCTGGTCATTACTAGTACTGCC
>5FLM_3|Chain C|DNA-DIRECTED RNA POLYMERASE II SUBUNIT RPB3|BOS TAURUS (9913)
MPYANQPTVRITELTDENVKFIIENTDLAVANSIRRVFIAEVPIIAIDWVQIDANSSVLHDEFIAHRLGLIPLTSDDIVDKLQYSRDCTCEEFCPECSVEFTLDVRCNEDQTRHVTSRDLISNSPRVIPVTSRNRDNDPNDYVEQDDILIVKLRKGQELRLRAYAKKGFGKEHAKWNPTAGVAFEYDPDNALRHTVYPKPEEWPKSEYSELDEDESQAPYDPNGKPERFYYNVESCGSLRPETIVLSALSGLKKKLSDLQTQLSHEIQSDVLTIN
>5FLM_4|Chain D|DNA-DIRECTED RNA POLYMERASE II SUBUNIT RPB4|BOS TAURUS (9913)
MAAGGSDPRSGDVEEDASQLIFPKEFETAETLLNSEVHMLLEHRKQQNESAEDEQELSEVFMKTLNYTARFSRFKNRETIASVRSLLLQKKLHKFELACLANLCPETAEESKALIPSLEGRFEDEELQQILDDIQTKRSFQY
>5FLM_5|Chain E|DNA-DIRECTED RNA POLYMERASES I, II, AND III SUBUNIT RPABC1|BOS TAURUS (9913)
MDDEEETYRLWKIRKTIMQLCHDRGYLVTQDGLDQTLEEFKAQFGGKPSEGRPRRTDLTVLVAHNDDPTDQMFVFFPEEPKVGIKTIKVYCQRMQEENITRALIVVQQGMTPSAKQSLVDMAPKYILEQFLQQELLINITEHELVPEHVVMTKEEVTELLARYKLRENQLPRIQAGDPVARYFGIKRGQVVKIIRPSETAGRYITYRLVQ
>5FLM_6|Chain F|DNA-DIRECTED RNA POLYMERASES I, II, AND III SUBUNIT RPABC2|BOS TAURUS (9913)
MSDNEDNFDGDDFDDVEEDEGLDDLENAEEEGQENVEILPSGERPQANQKRITTPYMTKYERARVLGTRALQIAMCAPVMVELEGETDPLLIAMKELKARKIPIIIRRYLPDGSYEDWGVDELIITD
>5FLM_7|Chain G|DNA-DIRECTED RNA POLYMERASE II SUBUNIT RPB7|BOS TAURUS (9913)
MFYHISLEHEILLHPRYFGPNLLNTVKQKLFTEVEGTCTGKYGFVIAVTTIDNIGAGVIQPGRGFVLYPVKYKAIVFRPFKGEVVDAVVTQVNKVGLFTEIGPMSCFISRHSIPSEMEFDPNSNPPCYKTMDEDIVIQQDDEIRLKIVGTRVDKNDIFAIGSLMDDYLGLVS
>5FLM_8|Chain H|DNA-DIRECTED RNA POLYMERASES I, II, AND III SUBUNIT RPABC3|BOS TAURUS (9913)
MAGILFEDIFDVKDIDPEGKKFDRVSRLHCESESFKMDLILDVNIQIYPVDLGDKFRLVIASTLYEDGTLDDGEYNPTDDRPSRADQFEYVMYGKVYRIEGDETSTEAATRLSAYVSYGGLLMRLQGDANNLHGFEVDSRVYLLMKKLAF
>5FLM_9|Chain I|DNA-DIRECTED RNA POLYMERASE II SUBUNIT RPB9|BOS TAURUS (9913)
MEPDGTYEPGIVGIRFCQECNNMLYPKEDKENRILLYACRNCDYQQEADNSCIYVNKITHEVDELTQIIADVSQDPTLPRTEDHPCQKCGHKEAVFFQSHSARAEDAMRLYYVCTAPHCGHRWTE
hi @jamesdalg I suspect it's only reading the CONECT records, which it looks like you've got for your sulphurs. If you do mda.Universe(..., guess_bonds=True)
it will guess bonds based on coordinates in addition to the CONECT records, this might get what you expect.
Just to add here, the current bond guessing is a little bit dumb (looks at things like the distance between atoms), indeed one could improve this by using the expected bonds from known residues. There is currently work in progress to do this which should hopefully make it to a release of MDAnalysis soon (see: https://github.com/MDAnalysis/mdanalysis/pull/3866).
I am closing this issue as resolved for now, please do re-open if you think this needs further discussions / there are other underlying issues not currently addressed.
Thanks! This was insightful and helpful.
@IAlibay If you get this fixed, let me know. right now, it draws connections between the DNA and the tail of the polymerase. If you know of a decent workaround that would be helplful to know as well. Is there some format that will have the topology correct from the get-go that I could convert the PDB to? What do you do in your simulations? I've tried XPDB, GSD (which has the topology correct, but drops the residues), FHIAIMS, and several others. Is there a conversion tool you'd recommend that could work to convert it to something that I could have the correct topology and residues in MDAnalysis?
@jamesdalg - am I correct in understanding that you are getting these weird bonds after guessing them?
One quick option might be to guess on atomgroups for parts of the system that you know are contiguous, i.e. protein = u.select_atoms('protein') protein.guess_bonds() dna = u.select_atoms('resnames DA DC DG DI DT DU') dna.guess_bonds()
I think that should avoid guessing bonds between the contiguous polymers.
Alternatively re: bond containing files, you'd have to use "fully informative" molecular dynamics topologies (e.g. Gromacs TPR, and AMBER PARM7). What is your current entry point / simulation method for your files?
@IAlibay I think I'll try the guess_bonds on individual segments. That's the most sensible way of going about it. Here's my approach: I'm using MDAnalysis to pull in a PDB from https://www.rcsb.org/structure/5flm, from there, I'm attempting to coarse grain it, taking the average x,y,z coordinates and plugging them into the positions in a universe object. From there, I use the existing, unique bonds between residues (using pandas with merges on a table of atoms and a table of bonds) and then add them to a new universe where each residue has one particle per residue, then output the file to LAMMPS. I could easily try it in GROMACS or CHARMM, which is what's nice about MDAnalysis-- once all the pieces are put in correctly, you can export it anywhere. I think I'll try to convert the PDB to gromacs first and then import it from there. I'll try AMBER too.
Great suggestions! Thanks!
Expected behavior
All of the bonds in molecule are imported into the PDB object. At the moment, I'm importing the human polymerase, but I'm getting very few of the bonds to actually come through. I would imagine that, based on the fact alone that it is made of proteins... there should be more than 35 or 36 bonds in the entirety of a molecule with 32892 or 32712 atoms. It's a bit odd... is this typical?
Actual behavior
PDB files for 7OL0 and 5FLM import with less than 40 bonds in total.
Code to reproduce the behavior
Current version of MDAnalysis
Which version are you using? (run
python -c "import MDAnalysis as mda; print(mda.__version__)"
) -2.4.2Which version of Python (
python -V
)?(base) jd@oban:/code/python/reprex-3-29-2023$ python -V Python 3.10.8
Which operating system? Ubuntu 22.04.2 LTS:
(base) jd@oban:/code/python/reprex-3-29-2023$ lsb_release -a LSB Version: core-11.1.0ubuntu4-noarch:security-11.1.0ubuntu4-noarch Distributor ID: Ubuntu Description: Ubuntu 22.04.2 LTS Release: 22.04 Codename: jammy