Build a pipeline to map mutations to protein 3D structures

jjgao commented 8 years ago

Background: Protein 3D structures are very useful when studying the effect of mutations in cancer. For example, oftentimes mutations tend to occur in the same 3D space. In the cBioPortal, we map and visualize cancer mutations to protein structures.

Goal:

Build sequence alignments of all human proteins (including all their isoforms) against PDB structures.
Build an automated pipeline to update protein structure data in cBioPortal.

Approach:

Using BLAST (or other sequence alignment tools) to search human proteins against PDB strctures.
Save alignments in to a database.
Develop a JSON based API to expose the alignment data.
Build a pipeline to periodically check PDB for new structures.

What we currently have:

We have align all human proteins in UniProt to PDB. But we rely on another in-house pipeline and it is not easy to update. That's why the data is out of date.
We have put the data into the portal database and visualize them in the page. e.g. click 3D structure in this page.

Need skills: Bioinformatics, Java, JSON

Possible mentors: JJ Gao, Onur Sumer, Benjamin Gross, Angelica Ochoa

dilshankanchana commented 8 years ago

hi, I am interested in contributing to this project. I'm a 3rd year Computer Engineering undergraduate student from University of Moratuwa. I would be grateful if you can advise me regarding how to proceed with this project. thank you.

apurvgandhwani commented 8 years ago

Hi, I am a 4th year Undergrad from Indian Institute of Technology, Bombay (IIT Bombay). I am interested in contributing to this project. I have been digging into NCBI for some time to get the better insight of what is intended from this project. As far as I understood we want to extract protein data (for now let's call it data) from NCBI (BLAST) and align it against similar type of structures in our Protein Data Bank. This willl be saved in NexusDB . Updating the data will be another aspect. Please correct me If I am wrong somewhere. I would like to get started working on it. For that can someone please enlighten me with how to proceed the current process that we follow for the same task (If there is any). Also it would be great if i could get some reference to some work on it (if any).

p-sun commented 8 years ago

Hi! I'm on my second-last semester for a Microbiology and Immunology Liberal and a Computer Science Minor at McGill University. I have written a proposal for idea #17 and sent it to JianJiong Gao on LinkedIn. Could you let me know if there are preferred methods of contact? Thank you very much!

jjgao commented 8 years ago

@dilshankanchana We would require some bioinformatics skills in this project.

jjgao commented 8 years ago

@apurvgandhwani we would like to map all human proteins in Ensembl to PDB and get all the alignments stored and provide the data through a web service. Please try to identify challenges in the task and break it down to sub-tasks with a detailed implementation plan.

jjgao commented 8 years ago

@p-sun please send leave an email to the discussion group (without including your proposal link). I'll get back to you from there.

apurvgandhwani commented 8 years ago

@jjgao Thanks. That would help. I will soon come up with a structured plan with all the task breakups to you for your further suggestions.

apurvgandhwani commented 8 years ago

@jjgao As of now it is pretty much clear to me "What do we need to achieve from this project". I have some general doubts regarding the "why" part of the project such as

Who will be the end users of the project.
What is the need to stored aligned data as there are tools (like Clustal Omega, MUSCLE, BLAST etc.) which can be used to aligned any protein on the go.

Also I was trying to find the scope we could utilize NexusDB into this project. Came to know about ISCA consortium data in nexusDB (recently released) which could be quiet useful in bringing some add-ons (not exactly sure about specifics).

jjgao commented 8 years ago

@apurvgandhwani

1) The end users are the cBioPortal users -- biologists who analyze mutation data in cancer patients. 2) For performance reason, we can BLAST a protein against the whole PDB database on the fly when a user is querying a gene on cBioPortal.

Don't worry about NexusDB. I mentioned Genome Nexus because that's project we are developing for providing annotation for mutations.

I have also updated the description to include what we currently have.

apurvgandhwani commented 8 years ago

@jjgao I started fetching data from BLAST using it's API and BioJava. I am getting the XML data which i am parsing and storing. Is it fine if i start BLASTing all the human protein data and start storing alignments in a local database, as this will be the first task for the project anyhow.

jjgao commented 8 years ago

@apurvgandhwani sounds good.

apurvgandhwani commented 8 years ago

@jjgao Cool. Then I am working on completing a SQL database. Thanks

apurvgandhwani commented 8 years ago

@jjgao how should i send you the draft of my proposal for review?

apurvgandhwani commented 8 years ago

@jjgao after giving all the thoughts into the options for building a pipeline to update the database periodically i came with following two options: i)Either we blast each protein from ensembl against PDB periodically (say every week as PDB releases the newly added sequences every week) and write logic to fetch the the new alignments and append these to our database. This will be repetitive process with unnecessary iterations. ii) Download .tz file of new sequences which PDB itself makes available to us. Now fetching the protein sequences from this text file (running script) and then aligning these new sequences to ensembl proteins using Clustal Omega will give us all the new alignments. We can then store this to our alignment database along with the parameters we need. I will explain every thing in detail in proposal also. I am more inclined to use the second approach as it reduces the number of iterations and hence increase the performance. What's your take on this?
I will let you know if somethings better comes up.

apurvgandhwani commented 8 years ago

@jjgao I have submitted my proposal. Looking forward for your feedback and suggestions.

sheridancbio commented 8 years ago

I have recently joined this channel, as a potential mentor for this project. I am somewhat new to the cBioPortal project team, but I have worked on protein structure projects before including sequence mapping between sequence databases and PDB structure file sequences.

Looking at Ensembl, there are releases every few months (Sep30, Dec8, Mar10). When moving to a new release of a sequence database there are several tricky issues to consider (identifier changes, "merger" of sequences in the database, dropping of invalidated sequences) in addition to the addition of new sequences. Since releases of the sequence database are not too frequent, I think (someone correct me if I am wrong) that we will reset/rebuild the alignments (mappings) of all sequences at the time that we move to a new version of Ensembl.

PDB changes also occur, more frequently and more "fine-grained". There will be some new .PDB files added to rcsb.org, but also sometimes .pdb structure files are dropped, and sometimes they are updated. So I think that we would want to periodically identify changes in these three categories and adjust our database of alignments accordingly (dropping alignments to eliminated .pdb structures, finding alignments to added .pdb structures, and updating alignments to updated .pdb structures).

At a minimum, we would want to create additional alignments to .pdb structures which are newly released. If we let the alignments to deleted .pdb files stay in until the next Ensembl update, then we would want to keep a local copy of the .pdb structure deleted from RCSB for display.

Whether we want to keep a "complete mapping" of every human Ensembl sequence is perhaps up for debate ... we could also consider creating alignments at the time of need. But alignment programs sometimes take time to execute, and we would need to have processing power on-call to run those alignments quickly if we wanted the website to be responsive enough. We also probably would want to begin the alignment process prospectively at the earliest indication (before the user was navigating to a structure view).

These are just my thoughts .. to try to help elaborate some of the issues.

apurvgandhwani commented 8 years ago

@sheridancbio Thanks for pointing out some very important points regarding database and new sequences. I have been trying to figure out the ways to incorporate the new alignments which we to update periodically as I have already started fetching alignments in a local database. I have explained them in detail in my proposal which i have uploaded for review. RCSB releases new protein entries every Saturday which is available on nrcb.org. Therefore we will have to data every week. And as you said, for Ensembl also we will have to keep data in sync with the latest one.

I will incorporate the things which I have missed in my draft proposal by latest. Kindly review it. Thanks.

sheridancbio commented 8 years ago

Another thought about this project: there will often be cases where the sequence present in the PDB structure file is missing large pieces of the complete protein sequence. The PDB structure may contain one domain from a multi-domain protein. There may also gaps / missing residues where the protein was not experimentally measured. And there are ambiguities (such as alternative conformations .. see AltLoc description in the ATOM record type) as well which may need to be handled. This is just to make people aware that the mapping / blasting step may involve solving or working around some tricky issues.

apurvgandhwani commented 8 years ago

@jjgao @sheridancbio can u please provide a feedback on where i went wrong. This came as a shock to me honestly. It would be really helpful if you could point out my mistakes.

StefanGIT commented 8 years ago

We do have a tool to visualize uniprot features or user annotated features on a pdb structure. This works in cases where a mapping (SIFTS) between pdb and uniprot exists. Whould this viewer be helpful: http://prosat.h-its.org/

cBioPortal / GSoC

Build a pipeline to map mutations to protein 3D structures #17