SWISS-MODEL / covid-19-Annotations-on-Structures

Mapping sequence data onto structures for the Covid-19 Biohackathon April 2020
https://github.com/virtual-biohackathons/covid-19-bh20/wiki/Annotations-on-Structures
MIT License
2 stars 8 forks source link

[WIP] Scripts for generating structure-derived annotations #29

Closed Ninjani closed 4 years ago

Ninjani commented 4 years ago

Script to generate structure and elastic network model-derived annotations from PDB files

gtauriello commented 4 years ago

@all-contributors please add @Ninjani for code

allcontributors[bot] commented 4 years ago

@gtauriello

I've put up a pull request to add @Ninjani! :tada:

gtauriello commented 4 years ago

@all-contributors please add @akdel for content

allcontributors[bot] commented 4 years ago

@gtauriello

I've put up a pull request to add @akdel! :tada:

gtauriello commented 4 years ago

Thanks for the code so far. For the annotation mapping: do you plan to have a list of PDB IDs with offsets for those mappings? Also would it be possible to keep a (possibly rounded) representation of the numeric value in the annotation text? This would imprive the browsing through the results...

I will keep this PR open so you can just easily keep pushing to your fork and we can comment on changes here...

Ninjani commented 4 years ago

Yes we were planning to have lists of PDB IDs mapped to the corresponding sequence, as well as pre-defined groups of PDB IDs that make sense to compare (e.g. proteins from SARS-Cov2 with different ligands, or proteins from SARS-Cov2 with SARS-Cov), will do this tonight. Good idea about including the numbers in the text, we'll add that. Thanks!

akdel commented 4 years ago

The code is quite usable now. Here's a list describing the current functionality:

  1. Extract experimental PDBs associated with a given Uniprot ID and protein name. For instance, this let's you obtain all the PDB ids which are related to the 3C-like proteinase in P0DTD1 (Replicase polyprotein 1a), see example_usage.py. You can also retrieve PDB IDs based on available annotations on the UniProt sequence. The code maps the PDB residue index to the reference UniProt index (using muscle to align, maybe there's a better way?).

  2. Using above functionality, it's possible to retrieve a relevant set of PDBs and create an ensemble of structures. The ensemble can be used to compute:

    • RMSD of each structure to the reference structure
    • RMSD of each residue across all structures
    • fluctuations of each residue according to a PCA across all structures
  3. For a single PDB ID, we predict (according to elastic network models)

    • residue fluctuations
    • perturbation response
    • mechanical stiffness
    • hinge sites
  4. Points 2 and 3 can be written out as structure annotation and visualized in the beta SWISS-MODEL annotation website (see example_usage.py and the annotations folder for examples).

gtauriello commented 4 years ago

For the code: could you add a little README to the folder to describe the functionality (and also otherwise check if anything else is missing according to the contribution guidelines)? It looks like great work. Thanks.

Also if there are useful reusable components in here for other PRs, we could move those to the utils folder and do a separate PR to merge them into master?

On SIFTS mappings: if they are correct you shouldn't have to run any alignments with muscle but can use the residue numbering and just use the difference between "start" and "unp_start" as offset. So basically a continuous stretch in the PDB-seqres (which is what the residue numbering in mmCIF files is linked to) should match a continuous stretch in UniProt-sequence. Unfortunately, the data in "best_structures" SIFTS-API (which you use) doesn't handle the case where a continuous PDB-stretch maps to a discontinuous stretch in UniProt. As far as I can see for cov-structures, this only happens in 6lxt which we have blacklisted for our cov-page (also for various other reasons). The per-pdb SIFTS-API does handle such cases (see here for 6lxt and our display in SMR which uses little 'T' symbols to mark jumps in sequence in the alignment).

In terms of PDB structures on our cov-page vs SIFTS I noticed the following:

Ninjani commented 4 years ago

We added a README. I think the parse_pdbe.py would fit better in utils as it can be used to obtain a dictionary mapping from a PDB ID to the UniProt reference sequence. Will make another pull request for this.

About the mapping, we did start out by trying to directly use the given offsets but it didn't work for some proteins. e.g. for 6yb7 chain A, the information from PDBe is:

{'end': 306,
 'chain_id': 'A',
 'pdb_id': '6yb7',
 'start': 1,
 'unp_end': 3569,
 'coverage': 0.043,
 'unp_start': 3264,
 'resolution': 1.25,
 'experimental_method': 'X-ray diffraction',
 'tax_id': 2697049}
pd.parseCIF("6yb7", chain="A").select(f"resnum 1:306").select("calpha").getSequence()

'SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRVIGHSMQNCVLKLKVDTANPKTPKYKFVRIQPGTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSCGSVGFNIDYDCVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVNVLAWLYAAVINGDRWFLNRFTTTLNDFNLVAMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTF'

Compared to the reference (discrepancies in bold):

mapper.uniprot_sequence[3264 - 1: 3569]

'SGFRKMAFPSGKVEGCMVQVTCGTTTLNGLWLDDVVYCPRHVICTSEDMLNPNYEDLLIRKSNHNFLVQAGNVQLRVIGHSMQNCVLKLKVDTANPKTPKYKFVRIQPGQTFSVLACYNGSPSGVYQCAMRPNFTIKGSFLNGSCGSVGFNIDYDCVSFCYMHHMELPTGVHAGTDLEGNFYGPFVDRQTAQAAGTDTTITVNVLAWLYAAVINGDRWFLNRFTTTLNDFNLVAMKYNYEPLTQDHVDILGPLSAQTGIAVLDMCASLKELLQNGMNGRTILGSALLEDEFTPFDVVRQCSGVTFQ'

So we decided to play it safe and make an alignment after selecting by the given offsets. We could also remove this and throw an error in these cases of course, what do you think?

gtauriello commented 4 years ago

Ok so analyzing pd.parseCIF("6yb7", chain="A").select(f"resnum 1:306").select("calpha").getSequence() I see two issues:

  1. It looks that prody does python-style slicing. So '1:306' doesn't include the end residue at pos. 306. As a result your sequence is too short...
  2. The sequence of a PDB-entry (and what I referred to as 'seqres' above) doesn't have to correspond to the list of residues for which you have coordinates. Very commonly you will have gaps of parts which couldn't be resolved experimentally and re-aligning this information is bound to unnecessarily introduce errors. As an example here is the PDB-entry 6vyb as displayed in our template-library. The "Seqres" in the alignment is the protein sequence and below you see residues which have any atom coordinates in the respective chains. Lots of gaps. We actively avoid using any type of alignment there since frankly there are ambiguous cases which would break the alignment. Instead we use the residue numbers to map each residue to its spot within the seqres.

So in general I don't think it's a good idea to map residue-indices to UniProt-sequence-positions but rather to map residue-numbers to UniProt-sequence-positions.

And to be honest there are way more special cases to deal with when generalizing SIFTS mapping readings. I am doing some simplifications here to make this easier, but the only clean data for the mapping is parsing the xml files from SIFTS (see links here). That's what we use for the mappings in our repository. In the xml each residue is listed separately and so you can generate a clean map of residue-number to UniProt-sequence-positions.

Ninjani commented 4 years ago
  1. Yes true, this needed to use the "to" syntax rather than the ":"
  2. Yes we realized that sometimes residues are missing but even using the resnum and taking these into account led to errors in mapping for some PDB IDs. Will look into the xml as you suggest, thanks!
gtauriello commented 4 years ago

For point 2: the PDB seqres can have mismatches compared to the UniProt-sequences. That's ok and can happen for various reasons of generation of the structures. These are annotated as "conflict" in the xml and you can also spot the on PDBe-KB. This also means that at those positions you probably shouldn't map any annotations where sidechains are of relevance. So it's good that you spot them but it's expected to happen... As an example 6vyb has conflicts at UniProt-positions 986 & 987 (PDD-resnums 1005 & 1006): see in links to PDBe-KB and SMR.

Ninjani commented 4 years ago

Okay finally figured out the root of all the issues was that PDBe numbering was sometimes different from PDB/CIF resnum numbering. e.g.

PDB     CHAIN   SP_PRIMARY      RES_BEG RES_END PDB_BEG PDB_END SP_BEG  SP_END
6w01    A       P0DTD1  25      370     1       346     6452    6797

where the PDBe start is 25 but the PDB resnum start is 1.

Fixed this now by using the uniprot_segments_observed.tsv.gz file on the SIFTS page that maps from PDBe to PDB to UniProt. Also made another pull request (#38 ) to add this PDBe mapping script to utils.

gtauriello commented 4 years ago

@Ninjani Using the PDB resnum is probably a bad idea. PDB entries in mmCIF format have two res. numbers. The historical "PDB resnum" is the "author provided" res. numbers which were used in PDB format. They are fairly arbitrary since they are "author provided". In mmCIF you have both "author provided" (labeled "auth_seq_id") and the new and seqres-linked ones (labeled "label_seq_id"). The latter is preferred since it has consistency between data stored in the PDB entries (also that's what is referred in the SIFTS files as RES_BEG and RES_END). Often enough the two are reasonably consistent (or just moved by an offset) but there is no guarantee whatsoever that you have any type of continuous numbering in those "PDB resnum".

I hope that Prody reads also the seqres-linked res. numbers as they should do for proper parsing of mmCIF files. If they do, please don't use the PDB resnum but use the seqres-linked res. numbers instead.

And I would still argue that it's best to parse the xml file (which if really really necessary also includes the PDBresnum).

mlgill commented 4 years ago

@gtauriello @Ninjani

Using the master branch from Ninjani's fork, I have incorporated the solvent accessibility code into the single structure annotation code. It creates a text annotations file for upload to SWISS MODEL as well as a single CSV of all annotation values. A CSV file is not currently created for the ensemble annotations. I was uncertain about testing the upload, but happy to try if it won't cause problems.

I've also added a conda env file and accessory scripts, although the env file could probably use some cleanup.

My code is in the branch mlgill/08_surface_access_interface. I will be offline for much of the remainder of the day but happy to fix any issues later this weekend.

gtauriello commented 4 years ago

@mlgill No worries about uploading the annotations. Seems to work fine (used your Relative_Solvent_Accessibility.txt): https://beta.swissmodel.expasy.org/repository/covid_annotation_project/WLWmmq

I have some questions on what I am looking at though (might need more text in the README):

  1. What protein(s) was/were used to compute the surface accessibility? It seems to be a structure for nsp7 for two exp. structures exist (PDB: 6m71, 7btf) as heteromeric complexes.
  2. Given that nsp7 can form oligomeric complexes (see also our cov-page on this) it's important for surface accessibility to look at the single chain as well as the different complexes.
mlgill commented 4 years ago

@gtauriello Good catch -- I'd forgotten that DSSP will use all atoms in the PDB file to calculate solvent accessibility, regardless of which ones are selected in ProDy. I've updated the code so that it will create a new PDB file for DSSP if a subset of atoms are selected. Thus, the results returned will correspond to solvent accessibility of the selected atoms, assuming ONLY the selected atoms are present. I have also updated the readme.

Another alternative would be to calculate two sets of solvent accessibilities (one with all atoms and one with only the selections). However, I felt that was an unnecessary level of complexity since it's possible to achieve the same result by rerunning the calculation without a chain selection.

gtauriello commented 4 years ago

@mlgill Could you elaborate (also in the README) on what the "selected atoms" are? I suppose with "selected atoms" you mean that you select all atoms in a single chain which would be the "solvent accessibility of a single chain in isolation"?

I want to stress that in the case of solvent accessibility (and possibly also other annotations) for proteins which are known to be found in oligomeric complexes (as is the case for nsp7) it is is very important to consider both the protein chain in isolation and the protein complex. Proteins are rarely in isolation in cells. So if we focus too much on chains in isolation we risk to provide results which are of limited use...

Also for the README (for all annotations) it is important to describe what we find in the files in the annotations-subfolder. Currently for instance it's hard to understand what PDB files were used to compute those annotations. Basically it should be possible (at least for someone like me) to skip the code, look at the available annotations and know what I am looking at only using the information in the README...

mlgill commented 4 years ago

@gtauriello Thanks for the feedback. I will work on the updates when I am able to. In the meantime, could you please elaborate on the following:

Also I still don't see a comment for the Relative_Solvent_Accessibility.txt file stating which PDB file was used to get those annotations. If I skip the code and look at the available annotations, I should be able to know what I am looking at only using the information in the README...

Happy to add this. I'd note that this is also missing from the output files @Ninjani created (e.g. here) -- I was following his format since I don't have a lot of experience with SWISS-MODEL. Accordingly, I'm going to assume that the output files from the methods he implemented should also be updated.

Is there documentation somewhere on the comment format for the annotations file? I did not see anything here.

I want to stress that in the case of solvent accessibility (and possibly also other annotations) for proteins which are known to be found in oligomeric complexes (as is the case for nsp7) it is is very important to consider both the protein chain in isolation and the protein complex. Proteins are rarely in isolation in cells. So if we focus too much on chains in isolation we risk to provide results which are meaningless for practical purposes...

I understand the concern regarding monomers vs complexes. However, this is an issue where there is unlikely to be a one-size-fits-all solution, at least not without adding a lot of complexity to the code and thus reducing usability by the target audience.

As I mentioned above, I see two potential straightforward solutions to the issue of monomer / oligomers:

Are either of these solutions acceptable? If not, could you please elaborate on what it is you'd like to see? As it currently stands, I am left guessing at what it is you're looking for.

gtauriello commented 4 years ago

First of all: sorry for the original version of my comment sounding a bit overly harsh. I noticed after I clicked on "comment" and edited it... ;-)

Also I generalized my comment on annotations as I completely agree that this is missing for all annotations. I would use the README to explain what can be found in those files. We didn't foresee that the annotations would have comments and hence I don't think it is possible to add the comments there. But I think it's perfectly fine to have the info in a README next to those files...

If the current code can be easily run without a chain selection and do the surface accessibility for the whole complex that's fine I think. Thanks for elaborating on it. Would be good to describe it in the README (e.g. "we have precomputed results for chain in isolation but if you run the code like ... you get the results for the protein complex"). And sorry if I don't have too much time to look at the code and what it does exactly. I am mostly judging the visible output and documentations.

mlgill commented 4 years ago

@gtauriello No problem at all -- these are the challenges associated with hackathons, especially distributed ones. :D

Your additional comments help clarify what you're looking for -- thank you! I'll make the fixes as I understand them later today and let you know when they're finished.

Ninjani commented 4 years ago

@mlgill @gtauriello Just wanted to add that one option we used for including the relevant PDB name is to add it to the title while posting the annotation to the SWISS-MODEL website (see here). This shows up when clicking the generated link for the annotation but is not shown on the page with the structure. Will add a README too as suggested.

mlgill commented 4 years ago

@Ninjani That's a good point. I want to add code to create an information output file -- is this what you meant by adding a "README", since we already have one in our str_annotation directory?

Also, I made some updates to the README in our directory here, including adding the names of the output files for each of the features. However, I'm not clear on which file contains the following: RMSD of each structure to the reference structure (the first PDB ID in list). Could you double check these file names and let me know what's missing?

Finally, I've been pulling your branch into mine (https://github.com/mlgill/covid-19-Annotations-on-Structures/blob/mlgill/08_surface_access_interface), but I have additional code that I've committed. If we use this MR, I'll either need to push my code to your branch (if you'll give me access) or I'll need to open a separate one. Let me know which you'd prefer.

gtauriello commented 4 years ago

Personally I would prefer if you merge the two forks and have both changes in this PR. So that would either require access for @mlgill in the master branch of @Ninjani fork or a PR of the mlgill-branch into the Ninjani-fork. Probably the former is easier...

Thanks for the README-updates. I think it's going in the right direction of clarifying the content of those folders...

mlgill commented 4 years ago

Agreed, I think it makes sense to have everything here so it's with the discussion. @Ninjani Let me know your preference. I'll pick this up again tonight when I've gotten a few other things done.

Thanks for the discussion, all. Have really enjoyed this and learned a lot.

Ninjani commented 4 years ago

@Ninjani That's a good point. I want to add code to create an information output file -- is this what you meant by adding a "README", since we already have one in our str_annotation directory?

Yes this is what I meant, a file that just maps each output annotation filename to whatever additional information (PDB ID, other parameters etc.) was used to create it.

Also, I made some updates to the README in our directory here, including adding the names of the output files for each of the features. However, I'm not clear on which file contains the following: RMSD of each structure to the reference structure (the first PDB ID in list). Could you double check these file names and let me know what's missing?

Ah you're right, I just realized I never wrote this information out to an annotation file because I was unsure how to annotate this. It applies to the entire reference structure as each RMSD value is calculated across the full structure but compared to each other structure in the ensemble - was not sure whether to make one file per structure in the ensemble or just a list of (PDB ID, RMSD) pairs attached to (all the residues in) the reference structure. Thoughts?

Finally, I've been pulling your branch into mine (https://github.com/mlgill/covid-19-Annotations-on-Structures/blob/mlgill/08_surface_access_interface), but I have additional code that I've committed. If we use this MR, I'll either need to push my code to your branch (if you'll give me access) or I'll need to open a separate one. Let me know which you'd prefer.

I've given you access to my fork now! I agree a merge seems best.

Thanks a lot for everything, this has been great! I'll try to finish up the few remaining tasks associated with this tonight or tomorrow.

mlgill commented 4 years ago

@Ninjani Thanks for the updates and for giving me access to your fork. I'll push my code there when I return to this tonight.

For the missing RMSD data, perhaps it's best to diverge from the format imposed by SWISS MODEL since this is unlikely to be an annotation. A csv could be created with the upper triangle of a matrix to store the pairwise RMSD values (assuming I understood correctly what you want)? Thoughts?

Ninjani commented 4 years ago

@mlgill Yes that makes a lot of sense, I'll do that. Thanks!

gtauriello commented 4 years ago

@mlgill the .vscode folder looks like local settings for you. Would it be ok for you to add it to .gitignore and remove the folder from the git repository?

mlgill commented 4 years ago

@Ninjani @gtauriello Just a heads up that I am trying to fix Ninjani's master branch so that all my changes are squashed into a single commit at the end. The way that I handled the push previously did not do this and I don't like the results for the sake of reproducibility. I won't delete any branches and am testing on other branches first.

mlgill commented 4 years ago

@Ninjani Using the latest version of your master, which is currently in the branch master_NINJANI due to code cleanup, I'm encountering the following error when I run the ensemble demo.

Unfortunately, this is one of the reasons I started the branch clean-up -- I thought that perhaps I'd introduced the issue somewhere and didn't catch it. Any thoughts on what might be causing this? Thanks

Exception has occurred: KeyError
('6y2g', 'B')
  File "/home/mgill/code/covid-19-NINJANI/str_derived_annotations/parse_pdbe.py", line 132, in map_to_pdb
    pdbe_mapping = self.all_residue_mapping[(pdb_id, chain_id)]
  File "/home/mgill/code/covid-19-NINJANI/str_derived_annotations/example_usage.py", line 49, in example_ensemble
    residue_mapping, _ = reference_mapper.map_to_pdb(pdb_info_list[0])
  File "/home/mgill/code/covid-19-NINJANI/str_derived_annotations/example_usage.py", line 99, in <module>
    example_ensemble()
mlgill commented 4 years ago

@Ninjani One more issue that should probably be addressed. I think it would be good to add code that at least checks for the existence of uniprot_segments_observed.tsv and either downloads it automatically or tells users where to download it and put it in the code. What do you think?

mlgill commented 4 years ago

@gtauriello @Ninjani

I believe I've finished fixing the branch issues. The following is an inventory of branches:

Ninjani, would you like to handle the switch over? Or grant me the ability to change/delete master?

Ninjani commented 4 years ago

One more issue that should probably be addressed. I think it would be good to add code that at least checks for the existence of uniprot_segments_observed.tsv and either downloads it automatically or tells users where to download it and put it in the code. What do you think?

@mlgill So this part still needs to be rewritten to use the XML files from PDBe instead of this file (see this comment thread) but I haven't yet found time to do this. I'll try again to find some time today, and once it's done we can go ahead with pull request #38 which moves the parse_pdbe script to utils. Then we'll have to refactor a couple of lines in the other files to import this script from utils instead of the current location.

I believe I've finished fixing the branch issues. The following is an inventory of branches:

I think I've granted the ability to change master. Let me know if this isn't the case, I'm still not great at figuring out github options.

mlgill commented 4 years ago

Hi @Ninjani

@mlgill So this part still needs to be rewritten to use the XML files from PDBe instead of this file (see this comment thread) but I haven't yet found time to do this. I'll try again to find some time today, and once it's done we can go ahead with pull request #38 which moves the parse_pdbe script to utils. Then we'll have to refactor a couple of lines in the other files to import this script from utils instead of the current location.

Apologies, I missed this discussion in that from earlier -- I was just noting issues that I'd seen, in addition to the other bug with PDB mapping that I mentioned above.

I think I've granted the ability to change master. Let me know if this isn't the case, I'm still not great at figuring out github options.

Unfortunately, it doesn't look like I can make the necessary changes -- I can't see the Settings panel shown in the first picture here. Have you given me admin privileges, as is described here?

mlgill commented 4 years ago

@gtauriello Hi, following up on your feedback:

I generalized my comment on annotations as I completely agree that this is missing for all annotations. I would use the README to explain what can be found in those files. We didn't foresee that the annotations would have comments and hence I don't think it is possible to add the comments there. But I think it's perfectly fine to have the info in a README next to those files...

If the current code can be easily run without a chain selection and do the surface accessibility for the whole complex that's fine I think. Thanks for elaborating on it. Would be good to describe it in the README (e.g. "we have precomputed results for chain in isolation but if you run the code like ... you get the results for the protein complex"). And sorry if I don't have too much time to look at the code and what it does exactly. I am mostly judging the visible output and documentations.

I have completed the code improvements for the solvent accessibility calculation -- by default, they use the entire PDB for for calculation, but this can also be changed so that only the chain from the residue mapping is exported. In both cases, only the residues from the residue mapper are exported so that all files are uniform.

I've added quite a bit of explanation to the README as well as a file that describes the calculation e.g. here.

Note that my code is currently in this branch while I wait for admin access to @Ninjani 's repo so I can make this branch master.

Please let me know of any additional feedback -- otherwise I will look through the report draft in-progress to determine what is needed from me.

Ninjani commented 4 years ago

@Ninjani Using the latest version of your master, which is currently in the branch master_NINJANI due to code cleanup, I'm encountering the following error when I run the ensemble demo. Unfortunately, this is one of the reasons I started the branch clean-up -- I thought that perhaps I'd introduced the issue somewhere and didn't catch it. Any thoughts on what might be causing this? Thanks

Ah I missed this comment, I'll take a look! It could be that the TSV file doesn't have some PDB IDs; yet another reason to switch to SEQRES and XML. I'm still trying to figure this out as I don't think ProDy stores the SEQRES identifiers, but haven't been able to spend a lot of time on it.

Unfortunately, it doesn't look like I can make the necessary changes -- I can't see the Settings panel shown in the first picture here. Have you given me admin privileges, as is described here?

@mlgill Hmm, I'm afraid this option is only available to organizational accounts while mine is a personal account. Maybe you could walk me through the steps I'd need to take to make the necessary changes?

gtauriello commented 4 years ago

@mlgill and @Ninjani for the branches: If the plan is to trash Ninjani:master and replace it with Ninjani:master_MICHELLE, we might as well do a new PR. Either way that assumes that there are no relevant changes in master compared to master_MICHELLE. Is that correct?

If you want to keep this PR and replace master, the commands to execute are these:

git checkout master_MICHELLE
git pull --rebase
git checkout master
git reset --hard master_MICHELLE
git push origin master --force

Note that this changes the commit history of master and hence might break any work done by anyone else with those checkouts. Hence, it might be a better option to just close this PR and start a new one from master_MICHELLE. It's usually anyways better to do work in branches also when forking, so that you can safely force-reset master to be in sync with the main fork...

mlgill commented 4 years ago

@mlgill and @Ninjani for the branches: If the plan is to trash Ninjani:master and replace it with Ninjani:master_MICHELLE, we might as well do a new PR. Either way that assumes that there are no relevant changes in master compared to master_MICHELLE. Is that correct?

If you want to keep this PR and replace master, the commands to execute are these:

git checkout master_MICHELLE
git pull --rebase
git checkout master
git reset --hard master_MICHELLE
git push origin master --force

Note that this changes the commit history of master and hence might break any work done by anyone else with those checkouts. Hence, it might be a better option to just close this PR and start a new one from master_MICHELLE. It's usually anyways better to do work in branches also when forking, so that you can safely force-reset master to be in sync with the main fork...

Thanks for the nudge @gtauriello and for the reply earlier @Ninjani. I have confirmed that master_MICHELLE is up to date via a directory level diff. I also created a backup branch of master as it currently stands as a precaution.

@gtauriello, your command sequence seems to have worked -- thank you! I tried this sequence previously but I think I did not rebase when pulling. It now appears that Ninjani:master is up to date and the commit history is correct. I've also done another directory level diff on the files and they match master_MICHELLE.

Given that this works, do we want to leave this PR open? It does have some important discussion about open items that I believe remain on @Ninjani 's portion of the code:

Does this sound correct to everyone? @Ninjani, maybe you could provide an estimate of when you might be able to address the first item.

If we want to trash this PR, then I'm happy to close it and create a new one with the remaining issues above. Just let me know.

Ninjani commented 4 years ago

@mlgill Great that the merge worked! I'm sorry for the delay on my part with addressing the XML issue, some deadlines caught up with me. I'll finish it up this weekend and double check that the bug you found is resolved. I think it makes sense to keep these issues in this PR with the relevant discussion about them, but up to you!

mlgill commented 4 years ago

@Ninjani Glad that all appears good on your end -- I agree, let's keep things in this PR. Please ping me here if you need anything this weekend. Happy to test that the bug is fixed for the third item and ensure there are no other regressions.

Ninjani commented 4 years ago

@mlgill We wrote the XML parsing code and switched everything to use that - it does seem better right away since it handles missing residues. The changes are in the sift_xml_parser branch. I've tested it with the example_usage and everything runs fine. Also, I believe that the bug you noticed with ('6y2g', 'B') is also fixed - I could obtain mappings for that pdb_id, chain_id combination with the new parse_pdbe.get_pdb_to_uniprot_mapping function. Could you take a look to make sure we didn't break anything else? If everything's good we can move forward with factoring out the parse_pdbe.py file to utils and then merge both pull requests.

mlgill commented 4 years ago

@Ninjani confirmed that everything works for me! My only (optional) suggestion is to remove the cif files after they are downloaded. Up to you.

Let me know if I can help further with code testing or refactoring. Can often make time on the weekend.

Ninjani commented 4 years ago

@mlgill Thanks for checking! Would indeed love some help with refactoring as I want to make sure the commits and pull requests are correctly implemented. Here's what I had in mind:

Does this make sense?

mlgill commented 4 years ago

Hi @Ninjani,

I read through your suggestions. I'm not clear why str_derived_annotations should be completely removed from the branch submitted for #38. Maybe you could explain further?

In the meantime, I've done the following as an intermediate check:

If you're ok with this, I'm happy to merge this branch into master on your repo. Then we can reference it in #38 and here, merge with master here, and close both PRs.

What do you think of this as a solution? Anything I've misunderstood?

Ninjani commented 4 years ago

@mlgill That sounds good! I guess I was thinking in terms of having the utils PR predate this one so that we could import from utils here but on second thought that sounds pretty confusing. Thanks for everything and please go ahead! Let me know if I should do something else, this has been quite exciting and I've learned a lot.

gtauriello commented 4 years ago

Yeah I agree that it's easier if we procees as @mlgill suggested and I simply close #38 referencing to the changes done in here. Thanks to both of you for all your efforts.

mlgill commented 4 years ago

Ok, I've merged refactor_parse_pdbe in master on Ninjani:master. We can now merge that with master on this repo and close this PR. Have also made a note on #38 that we can close it. Let me know if something else is needed to finish this out.

gtauriello commented 4 years ago

@mlgill @Ninjani Just one thing: is example_usage_ORIG.py still needed? Seems like a backup of example_usage.py...

Also: I would preferably "squash and merge" it into the main master (i.e. just one commit for the whole PR). I hope that's ok for you. I can try to put together a summary commit text but if any of you feels inspired to write a short summary, that would be great of course...

mlgill commented 4 years ago

@gtauriello Good catch -- that was indeed a stray backup file that I accidentally merged and forgot to delete. I have fixed it.

@Ninjani Would you like to do the honors of writing a summary paragraph since you created the foundation for our methods?