Closed jadolfbr closed 10 months ago
Thanks for using it!
You can exclude biounits that contain chains similar to a chain from the PDB by using the --exclude_chains
and --exclude_threshold
arguments of proteinflow split
.
So if you already have a dataset that is split into subsets, you can run proteinflow unsplit --tag {tag}
, or if you want to download / generate a new one, run proteinflow download
or proteinflow generate
with the --skip_splitting
flag.
And then run e.g. proteinflow split --tag {tag} --exclude_chains 7kgk-A --exclude_threshold 0.7 --ignore_existing
to split the dataset again and exclude all biounits that contain chains that are more than 70% similar to 7kgk-A.
There's also the --exclude_clusters
option to exclude whole clusters if one of those biounits belongs there and --exclude_based_on_cdr
to only exclude particular CDR clusters (for SAbDab datasets).
The excluded files will be moved to the excluded
folder at the same level as train
, test
and valid
.
I have actually just pushed the latest version of the package (1.4.0) that deals with those files a bit better and updates the split dictionaries accordingly.
I mean, right now I have 11 sets of different sequences. Now they are public, but needn’t be. I’m not sure how what you have shown would allow me to run it. I have fasta sequences, no PDB Id as some of these don’t have crystals - so AF models are used. In addition to being more complete structures (ie no missing loops, density, etc etc).
If that’s beyond scope, I understand, but it would certainly be useful as an input. Just a fasta wirh a set of sequences. Many times this is what we have for training.
On Fri, Jun 23, 2023 at 1:25 PM Liza Kozlova @.***> wrote:
Closed #76 https://github.com/adaptyvbio/ProteinFlow/issues/76 as completed.
— Reply to this email directly, view it on GitHub https://github.com/adaptyvbio/ProteinFlow/issues/76#event-9620092650, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZDHRDURTSCBSIMMKFL2OTXMXGQPANCNFSM6AAAAAAZRYNYB4 . You are receiving this because you authored the thread.Message ID: @.***>
The current work around is to look up your sequences in the PDB database and find the closest homolog to the sequence (hopefully there is something with >90% sequence similarity to some of your sequences). Then you can use the homolog's PDB IDs with the--exclude_chains
tag to have a similar splitting outcome. This is not perfect and might not work in your case, but if you need to do this asap you can implement it this way. We plan to add support to exclude proteins by sequence in next releases.
I'll reopen the issue so that we keep it in mind.
Thanks for the workaround, keeping it open, and the info about the future!
On Tue, Jun 27, 2023 at 12:44 PM Liza Kozlova @.***> wrote:
Reopened #76 https://github.com/adaptyvbio/ProteinFlow/issues/76.
— Reply to this email directly, view it on GitHub https://github.com/adaptyvbio/ProteinFlow/issues/76#event-9654747171, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZDHRH2G7IJ4UXOAFID6NLXNMEWZANCNFSM6AAAAAAZRYNYB4 . You are receiving this because you authored the thread.Message ID: @.***>
So, this doesn't fix the issue. This just excludes sequences - the functionality is more about our own sequences/structures, especially from AF. IE many of these do not have PDBIds
With the new --exclude_chains_file
option (#113 ) it's possible to exclude custom sequences (just put them in a text file, one line = one amino acid sequence, no PDB id required). Is there something else we should add here @jadolfbr ?
Not everyone wants to use this on PDB structures. That’s the major problem. Some need it using alphafold structures. Some projects don’t need structures at all…
And there is currently no way to address either of these…
On Wed, Oct 11, 2023 at 7:25 AM Liza Kozlova @.***> wrote:
With the new --exclude_chains_file option (#113 https://github.com/adaptyvbio/ProteinFlow/pull/113 ) it's possible to exclude custom sequences (just put them in a text file, one line = one amino acid sequence, no PDB id required). Is there something else we should add here @jadolfbr https://github.com/jadolfbr ?
— Reply to this email directly, view it on GitHub https://github.com/adaptyvbio/ProteinFlow/issues/76#issuecomment-1757470580, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZDHRGS3F7O2ZEA5TI72FTX6Z6ZPANCNFSM6AAAAAAZRYNYB4 . You are receiving this because you were mentioned.Message ID: @.***>
Alright, so just to make it very clear, right now structures are not required if you want to exclude files based on sequences. Any custom sequence can be added to the text file specified with --exclude_chains_file
, no PDB format or structure of any kind is needed. If you want to exclude proteins that have sequences similar to "AAAAVWFAAA" or "DDDDDDRKRKRK", just create a text file (e.g. excluded.txt) that contains those two lines and pass this file as an option when splitting the data (--exclude_chains_file excluded.txt
).
We do not have support for generating new datasets using something other than SAbDab or PDB as the source, however, if that is what you mean. That kind of generalisation is harder to do.
Yes, that is what I mean. Not every uniprot sequence has a PDB, so being able to give a custom dataset in a fasta format or something like that would be ideal. In addition, perhaps those sequences were designs or sets of sequences/variants from experiments…
On Thu, Oct 12, 2023 at 5:06 AM Liza Kozlova @.***> wrote:
Alright, so just to make it very clear, right now structures are not required if you want to exclude files based on sequences. Any custom sequence can be added to the text file specified with --exclude_chains_file, no PDB format or structure of any kind is needed. If you want to exclude proteins that have sequences similar to "AAAAVWFAAA" or "DDDDDDRKRKRK", just create a text file (e.g. excluded.txt) that contains those two lines and pass this file as an option when splitting the data (--exclude_chains_file excluded.txt).
We do not have support for generating new datasets using something other than SAbDab or PDB as the source, however, if that is what you mean. That kind of generalisation is harder to do.
— Reply to this email directly, view it on GitHub https://github.com/adaptyvbio/ProteinFlow/issues/76#issuecomment-1759222048, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZDHRFIGHWQP6XCNPTQAZTX66XHZANCNFSM6AAAAAAZRYNYB4 . You are receiving this because you were mentioned.Message ID: @.***>
Really nice package! One thing I feel is missing is being able to split based on a set of sequences, for example, sequences that may have some biophysical properties one is trying to predict using ML methods.
I did not find a way to do this if it already exists.