Fixing mislabeling in protein viewer label

KianBadie commented 2 years ago

Currently, the protein viewer label sometimes displays incorrect values for a highlighted protein.

KianBadie commented 2 years ago

@coiko Thank you very much for the guide you provided on the protein labeling. I think I understand how the system works! I was trying to see the pattern in PV viewer's labeling system and I just wanted to ask some questions to confirm things.

PV labels atoms in a different way, but I think I still see a connection. Some examples of PV atom labels are "A.ARG453.CA" or "B.TYR292.CA". It looks like we determined that the first letter determines the protein number (A-Z being 1-26, a-z being 27-52, 1-9 being 53-61). It then looks like we use the first 3 characters of the middle sequence for the protein name (which seems to not have any problems). Lastly, we use the last 3 digits of the middle sequence for the residue number.

I had a couple of questions from this:

Are the residue numbers working as intended? Or are those causing problems as well?
In a big pdb like "3j31", does it go past A-Z? I was clicking around trying to find an example of something very large, but it seemed like everything stayed within the A-Z range.
Because of the above point, I am having a bit of trouble understanding how something like 1.60/A would look in PV. Is this difference in format expected? Or does it seem like some things are getting lost in translation with PV?

coiko commented 2 years ago

@KianBadie Ah, interesting. PV must be skipping the rest of the protein identifier in the pdb file and just picking out the letter part. (That explains why I noticed there are sometimes multiple proteins with the same number in a structure.) I can change the identifiers in the pdb files to use letters, but then I guess we're limited to only 26 unique proteins. I'll give it some more thought to see if there might be a way around that. (And yes, I think the residue numbering works great!)

coiko commented 2 years ago

@KianBadie As a follow-up, I just looked at the documentation for PV and it says "For chains loaded from PDB, the chain names are alpha-numeric and no longer than one character." So I think I'll need to modify the pdb files to best use those limited possibilities and give some thought to the best way to handle redundancy. I'll keep you updated on any thoughts I have. Thanks!

KianBadie commented 2 years ago

@coiko Thank you for looking into the documentation! But I see, that is unfortunate. So it looks like we are limited to 61 proteins then. Are there any possible remedies that can be done in the meantime on my end? Or is everything bottlenecked into the fact that we are limited to such a short amount of proteins?

KianBadie commented 2 years ago

@coiko I looked at the documentation you referenced. I'm guessing it is from this. Because of that, I did look more into the chain functionality. It looks like when I moved up the chain of the data representation of the structure, I arrived at an overall structure object. Oddly enough, it did report it had 18 chains for 3j31. From this discussion, it should be a lot more, right? I thought this was interesting because I thought there would be many more chains, but that the name would just be messed up.

Even more interesting is that it still said that the structure had 23088 atoms, which sounds like the big number I had in mind. Is any of this information relevant to what we are discussing? My apologies, I wish I could arrive to that conclusion but I do not know that much about pdbs! I will continue looking to see if there is someway to derive different atom numbering based on what I found.

Edit: I actually don't know if that information was as useful as I thought. It looks like the issue lies in the amount of represented chains, right? If that is the case, then it seems that only having 18 chains listed is the issue. For a moment, I was thinking the issue lied on the atom level, and I thought we could make some formula to get the atom number (something like chain number * atom number = actual atom number).

coiko commented 2 years ago

@KianBadie Sorry for the slow reply! And yes, you're exactly right that all of the atoms are represented, they're just assigned to fewer chains than they're supposed to be (so what are actually different proteins are labeled as the same protein). And yes, that's exactly the documentation I was looking at. The best way forward is probably for me to optimize the chain numbering in the pdb file so PV pulls out as many chains as possible (up to 61). And then we can think about whether we should display the protein numbers in the hover-over ID feature, or just use them for the color-by-protein feature. Since we have a limited palette, people will expect to see colors reused without getting confused that orange over here and orange over there are the same protein, but they could be confused if both show up as "Protein 3". What do you think? (Also let me know if I didn't explain this well!)

KianBadie commented 2 years ago

@coiko No worries! That's interesting and unfortunate what PV is doing. Is it in a way that makes sense? Or does it seem like PV viewer is malfunctioning because it was not designed for bigger PDBs? In addition, if it is not too inconvenient to check, does it look like RCSB's PDB viewer is doing the same thing? Or is theirs labeled correctly?

I think you explained that well. If I understood you correctly, you are saying that we might remove the protein labeling for the large PDBs? And that optimizing the files would mostly impact the coloring scheme? That is definitely an option and would be easy to implement on my side of things since it is just removing functionality for certain PDBs. With this particular issue, I'm not sure what the best course of action to take is since the lasting impact would be more so on the content (what content is displayed for the PDB) as opposed to the functionality/design of the interface. So I don't want to misrepresent what would be best for the target audience. From what you are saying though, it sounds like displaying nothing is better than displaying something that's incorrect. Is that correct?

coiko commented 2 years ago

Thanks @KianBadie! RCSB's PDB viewer (Mol*) is picking out the individual proteins correctly, which is interesting. They label them by type of protein (e.g. "coat protein" or "turret protein") and I think they're showing the chain IDs, but not in a way that's immediately obvious to me (or that would be particularly helpful for our users). So I think numbering the proteins might just be problematic, and we should probably stick to just trying to get the coloring to highlight individual proteins (by optimizing the pdb files), and removing the protein numbers from your labeling function (probably for all proteins, not just the big ones). That way, we still give the general sense of protein number and, exactly as you point out, we're not displaying something that's incorrect.

KianBadie commented 2 years ago

@coiko Sounds good! So would the residue/name still be displayed? Or did you want to remove the label all together?

coiko commented 2 years ago

Thanks @KianBadie! And yes, we should definitely still display the residue numbers (which are correct!) and names.

KianBadie commented 2 years ago

@coiko Sounds good! I will go ahead and implement that soon!

KianBadie commented 2 years ago

@coiko I just pushed the changes to remove the protein number from the label. It should be available soon.

coiko commented 2 years ago

Wonderful - thanks so much @KianBadie!

caltechlibrary / cell-atlas

Fixing mislabeling in protein viewer label #34