Open maxochondria opened 1 year ago
Hi, It's hard for me to tell what's going on. But it looks like the process that generates the features gets killed for some reason by your OS/ submission system: "run_alphalink.2.sh: line 24: 29939 Killed"
Do you maybe run out of memory?
Thanks for such a prompt reply! That indeed seems to be the problem. We ran another test which maxed out our 32GB of memory and crashed. How much memory do you recommend?
I honestly have no idea how much memory is generally required. I use a cluster and I'm fine with 80GB most of the time, for larger targets I increase it to 300GB. Maybe switching to the reduced_dbs setting in unifold/homo_search.py can limit the memory consumption. This setting requires the small_bfd database.
Hi All, firstly thank you for creating alphalink2!
Unfortunately, I too am having problems with feature generation, and I have ensured all naming-schemes are consistent. The issue appears to be that it skips feature generation for 'seq_A', but completes for 'seq_B' and 'seq_C'..
input Fasta (seq.fasta):
A .... B .... C ....
chains.txt: 'A B C'
yet, I get the error that: a) chains.txt isn't found in outdir/seq/, (chains.txt exists in one folder above). b) when I copy chains.txt to outdir/seq/ I get the error that it cannot find A.features.pkl.gz, and indeed it doesn't exist, though B.features.pkl.gz and C.features.pkl.gz exist. I cannot understand why, presumably somewhere in unifold/homo_search.py, it is skipping the first fasta sequence, and going to the second.
+ run_alphalink.sh /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_0/seq.fasta /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/xlinks.patti.A2B.pkl.gz /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_0/ /home/bhaddad/alphalink/AlphaLink2/../AlphaLink-Multimer_SDA_v2.pt /home/bhaddad/alphafold/DataBases 2020-05-01 2024-06-21 12:02:00.660510: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. /home/bhaddad/miniconda3/envs/alphalink/lib/python3.10/site-packages/Bio/Data/SCOPData.py:18: BiopythonDeprecationWarning: The 'Bio.Data.SCOPData' module will be deprecated in a future release of Biopython in favor of 'Bio.Data.PDBData. warnings.warn( I0621 12:02:29.087209 139711209805632 templates.py:945] Using precomputed obsolete pdbs /home/bhaddad/alphafold/DataBases/pdb_mmcif/obsolete.dat. I0621 12:02:29.132328 139711209805632 utils.py:74] Mapping multi-chains fasta with chain order: A B C I0621 12:02:29.137532 139711209805632 homo_search.py:161] searching homogeneous Sequences & structures for seq_B... I0621 12:02:29.143747 139711209805632 jackhmmer.py:140] Launching subprocess "/home/bhaddad/miniconda3/envs/alphalink/bin/jackhmmer -o /dev/null -A /tmp/tmp42vu50n8/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_0/seq_B.fasta /home/bhaddad/alphafold/DataBases/uniref90/uniref90.fasta" I0621 12:02:29.158784 139711209805632 utils.py:36] Started Jackhmmer (uniref90.fasta) query
Hi,
Thanks for reporting this! There was a bug with chains.txt that should be fixed now. Please try again (ideally with a fresh output directory). For the rest, it's hard for me to tell what's going on. Chains should only be skipped if they are homomeric but here it would skip B instead of A. Can you share the FASTA file? What does the directory look like?
Thank you for such a quick response! I am currently performing a fresh install of AlphaLink2, and I will provide an update as soon as it is run!
Unfortunately the PI of the project would prefer I keep the sequences 'close to the vest' until publication (currently under review), but I can say with the utmost confidence that chains 'B' and 'C' are identical. What appears to be happening, is it skips 'A' for whatever reason, performs featurization for chain B, then simply copies the features of chain B for chain C (predictably).
Chains B/C are the longer of the two unique sequences, with 367 AA. (fake sequence added below)
seq.fasta: 1 >chainA 2 GCEIGFIPSPVKLENMRLQHEQRAKQHTPPYDVVPSMRPVVLFI 3 >chainB 4 PGIADRMQKEITALAPSTMKIKIIAPPERKYSVWIGGSILASLSTF 5 >chainC 6 PGIADRMQKEITALAPSTMKIKIIAPPERKYSVWIGGSILASLSTF
Ok, no worries! Yes, the behaviour for B and C is expected. Strange about A, I will run a test.
I will allow the job to continue to see how far it gets, however the problem appears to persist. For what it is worth, I believe the problem lies in unifold/msa/utils.py get_chain_id_map
, I believe that 'mapping' is not being properly built, and that after line 30, the following should be added mapping[unique[seq]].append(chain_id)
within the 'if not' statment... then again, when I tried this modification wasn't solved... so maybe it's not the right place to look. I believe this, because when I ask "/unifold/homo_search.py" to print 'chain_mapping' it returns {'B': [C]}, thus it is the only key which is fed into generate_pkl_features
.
`(alphalink) Galvani [bhaddad@galvani ~/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB]$ tail -f mpi-err.34171
Yeah, I know what the issue is. I pushed a fix. Sorry for the inconvenience! Hope this resolves everything...
Indeed it appears to be working! Thank you very much!
So a new problem has cropped up that seems perhaps unrelated. I thought to post it here, but also thought it may be appropriate as its own issue.
I am receiving an index-error
I0622 23:21:36.164898 139698480359232 utils.py:75] Mapping multi-chains fasta with chain order: A B C
I0622 23:21:36.170187 139698480359232 homo_search.py:161] searching homogeneous Sequences & structures for seq_A...
I0622 23:21:36.171731 139698480359232 homo_search.py:202] Final timings for seq_A: {}
I0622 23:21:36.174052 139698480359232 homo_search.py:161] searching homogeneous Sequences & structures for seq_B...
I0622 23:21:36.175636 139698480359232 homo_search.py:202] Final timings for seq_B: {}
2024-06-22 23:21:39.438556: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
File "/home/bhaddad/alphalink/AlphaLink2/inference.py", line 363, in <module>
main(args)
File "/home/bhaddad/alphalink/AlphaLink2/inference.py", line 141, in main
batch = load_feature_for_one_target(
File "/home/bhaddad/alphalink/AlphaLink2/inference.py", line 74, in load_feature_for_one_target
batch, _ = load_and_process(
File "/home/bhaddad/alphalink/AlphaLink2/unifold/dataset.py", line 292, in load_and_process
features, labels = load(**load_kwargs, mode=mode, is_monomer=is_monomer)
File "/home/bhaddad/alphalink/AlphaLink2/unifold/dataset.py", line 217, in load
xl = bin_crosslinks(tp,size)
File "/home/bhaddad/alphalink/AlphaLink2/unifold/dataset.py", line 178, in bin_crosslinks
xl_[r1,r2,0] = xl_[r2,r1,0] = torch.bucketize(1-fdr, buckets)
IndexError: index 1326 is out of bounds for axis 0 with size 1021
$ ls
dir_0 dir_1 INPUTS jobscript_colabfold.sh mpi-err.34172 mpi-err.34173 mpi-out.34172 mpi-out.34173 xlinks.patti.A2B.pkl.gz
$ ls dir_0
A A.timings.json B B.timings.json C.feature.pkl.gz chains.txt seq_A.fasta seq_C.fasta
A.feature.pkl.gz A.uniprot.pkl.gz B.feature.pkl.gz B.uniprot.pkl.gz chain_id_map.json C.uniprot.pkl.gz seq_B.fasta seq.fasta
$ ls dir_0/A
bfd_uniclust_hits.a3m mgnify_hits.sto pdb_hits.sto uniprot_hits.sto uniref90_hits.sto
run_alphalink.sh $MYCWD/$DIRBASE$i/${seqList[$i]} \
$MYCWD/xlinks.patti.A2B.pkl.gz \
$MYCWD/$DIRBASE$i/ \
$AFL/../AlphaLink-Multimer_SDA_v2.pt \
/home/bhaddad/alphafold/DataBases \
2020-05-01 &
Hm, that would indicate a problem with the crosslinking data. The error says that the (mapped) crosslinking data overshoots the (combined) target length by more than 300 amino acids.
How long are your chains? Is the crosslinking data well aligned with the FASTA sequences? Ie no tags etc. chains.txt contains A B C, right?
Because I just noticed it: Your crosslinking pickle says A2B but your example above was A1B2, is the chain mapping off by chance?
Indeed, this is a case of user-error!
Thank you for developing such a promising software for the scientific community. I tried running Alphalink2 with several datasets, including the rpoa-rpoc dataset used in the bioRxiv manuscript but keep encountering the error below related to the creation of the feature.pgl.gz files. The file is either not created when running a monomeric prediction or only one of the file is created when running a multimeric prediction. I tried several iterations with different file names, fasta header names, crosslink file names, etc (no underscore, no dash, only A B C D... as fasta headers, etc ) but none of these changes solved the problem. I know that this issue has been raised and closed before but the previous fix doesn't seem to fix the problem in this case. Is there a specific naming convention I should adhere to?