adaptyvbio / ProteinFlow

Versatile computational pipeline for processing protein structure data for deep learning applications.
https://adaptyvbio.github.io/ProteinFlow/
BSD 3-Clause "New" or "Revised" License
176 stars 8 forks source link

Use of own sequences for splitting #76

Closed jadolfbr closed 10 months ago

jadolfbr commented 1 year ago

Really nice package! One thing I feel is missing is being able to split based on a set of sequences, for example, sequences that may have some biophysical properties one is trying to predict using ML methods.

I did not find a way to do this if it already exists.

elkoz commented 1 year ago

Thanks for using it!

You can exclude biounits that contain chains similar to a chain from the PDB by using the --exclude_chains and --exclude_threshold arguments of proteinflow split.

So if you already have a dataset that is split into subsets, you can run proteinflow unsplit --tag {tag}, or if you want to download / generate a new one, run proteinflow download or proteinflow generate with the --skip_splitting flag. And then run e.g. proteinflow split --tag {tag} --exclude_chains 7kgk-A --exclude_threshold 0.7 --ignore_existing to split the dataset again and exclude all biounits that contain chains that are more than 70% similar to 7kgk-A.

There's also the --exclude_clusters option to exclude whole clusters if one of those biounits belongs there and --exclude_based_on_cdr to only exclude particular CDR clusters (for SAbDab datasets).

The excluded files will be moved to the excluded folder at the same level as train, test and valid.

I have actually just pushed the latest version of the package (1.4.0) that deals with those files a bit better and updates the split dictionaries accordingly.

jadolfbr commented 1 year ago

I mean, right now I have 11 sets of different sequences. Now they are public, but needn’t be. I’m not sure how what you have shown would allow me to run it. I have fasta sequences, no PDB Id as some of these don’t have crystals - so AF models are used. In addition to being more complete structures (ie no missing loops, density, etc etc).

If that’s beyond scope, I understand, but it would certainly be useful as an input. Just a fasta wirh a set of sequences. Many times this is what we have for training.

On Fri, Jun 23, 2023 at 1:25 PM Liza Kozlova @.***> wrote:

Closed #76 https://github.com/adaptyvbio/ProteinFlow/issues/76 as completed.

— Reply to this email directly, view it on GitHub https://github.com/adaptyvbio/ProteinFlow/issues/76#event-9620092650, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZDHRDURTSCBSIMMKFL2OTXMXGQPANCNFSM6AAAAAAZRYNYB4 . You are receiving this because you authored the thread.Message ID: @.***>

danielnzg85 commented 1 year ago

The current work around is to look up your sequences in the PDB database and find the closest homolog to the sequence (hopefully there is something with >90% sequence similarity to some of your sequences). Then you can use the homolog's PDB IDs with the--exclude_chains tag to have a similar splitting outcome. This is not perfect and might not work in your case, but if you need to do this asap you can implement it this way. We plan to add support to exclude proteins by sequence in next releases.

elkoz commented 1 year ago

I'll reopen the issue so that we keep it in mind.

jadolfbr commented 1 year ago

Thanks for the workaround, keeping it open, and the info about the future!

On Tue, Jun 27, 2023 at 12:44 PM Liza Kozlova @.***> wrote:

Reopened #76 https://github.com/adaptyvbio/ProteinFlow/issues/76.

— Reply to this email directly, view it on GitHub https://github.com/adaptyvbio/ProteinFlow/issues/76#event-9654747171, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZDHRH2G7IJ4UXOAFID6NLXNMEWZANCNFSM6AAAAAAZRYNYB4 . You are receiving this because you authored the thread.Message ID: @.***>

jadolfbr commented 9 months ago

So, this doesn't fix the issue. This just excludes sequences - the functionality is more about our own sequences/structures, especially from AF. IE many of these do not have PDBIds

elkoz commented 9 months ago

With the new --exclude_chains_file option (#113 ) it's possible to exclude custom sequences (just put them in a text file, one line = one amino acid sequence, no PDB id required). Is there something else we should add here @jadolfbr ?

jadolfbr commented 9 months ago

Not everyone wants to use this on PDB structures. That’s the major problem. Some need it using alphafold structures. Some projects don’t need structures at all…

And there is currently no way to address either of these…

On Wed, Oct 11, 2023 at 7:25 AM Liza Kozlova @.***> wrote:

With the new --exclude_chains_file option (#113 https://github.com/adaptyvbio/ProteinFlow/pull/113 ) it's possible to exclude custom sequences (just put them in a text file, one line = one amino acid sequence, no PDB id required). Is there something else we should add here @jadolfbr https://github.com/jadolfbr ?

— Reply to this email directly, view it on GitHub https://github.com/adaptyvbio/ProteinFlow/issues/76#issuecomment-1757470580, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZDHRGS3F7O2ZEA5TI72FTX6Z6ZPANCNFSM6AAAAAAZRYNYB4 . You are receiving this because you were mentioned.Message ID: @.***>

elkoz commented 9 months ago

Alright, so just to make it very clear, right now structures are not required if you want to exclude files based on sequences. Any custom sequence can be added to the text file specified with --exclude_chains_file, no PDB format or structure of any kind is needed. If you want to exclude proteins that have sequences similar to "AAAAVWFAAA" or "DDDDDDRKRKRK", just create a text file (e.g. excluded.txt) that contains those two lines and pass this file as an option when splitting the data (--exclude_chains_file excluded.txt).

We do not have support for generating new datasets using something other than SAbDab or PDB as the source, however, if that is what you mean. That kind of generalisation is harder to do.

jadolfbr commented 9 months ago

Yes, that is what I mean. Not every uniprot sequence has a PDB, so being able to give a custom dataset in a fasta format or something like that would be ideal. In addition, perhaps those sequences were designs or sets of sequences/variants from experiments…

On Thu, Oct 12, 2023 at 5:06 AM Liza Kozlova @.***> wrote:

Alright, so just to make it very clear, right now structures are not required if you want to exclude files based on sequences. Any custom sequence can be added to the text file specified with --exclude_chains_file, no PDB format or structure of any kind is needed. If you want to exclude proteins that have sequences similar to "AAAAVWFAAA" or "DDDDDDRKRKRK", just create a text file (e.g. excluded.txt) that contains those two lines and pass this file as an option when splitting the data (--exclude_chains_file excluded.txt).

We do not have support for generating new datasets using something other than SAbDab or PDB as the source, however, if that is what you mean. That kind of generalisation is harder to do.

— Reply to this email directly, view it on GitHub https://github.com/adaptyvbio/ProteinFlow/issues/76#issuecomment-1759222048, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZDHRFIGHWQP6XCNPTQAZTX66XHZANCNFSM6AAAAAAZRYNYB4 . You are receiving this because you were mentioned.Message ID: @.***>