Rappsilber-Laboratory / AlphaLink2

AlphaLink2: Integrating crosslinking MS data into Uni-Fold-Multimer
Creative Commons Attribution 4.0 International
46 stars 14 forks source link

Feature.pkl.gz file not found #11

Open maxochondria opened 1 year ago

maxochondria commented 1 year ago

Thank you for developing such a promising software for the scientific community. I tried running Alphalink2 with several datasets, including the rpoa-rpoc dataset used in the bioRxiv manuscript but keep encountering the error below related to the creation of the feature.pgl.gz files. The file is either not created when running a monomeric prediction or only one of the file is created when running a multimeric prediction. I tried several iterations with different file names, fasta header names, crosslink file names, etc (no underscore, no dash, only A B C D... as fasta headers, etc ) but none of these changes solved the problem. I know that this issue has been raised and closed before but the previous fix doesn't seem to fix the problem in this case. Is there a specific naming convention I should adhere to?

I0911 22:57:46.547931 140125146163008 utils.py:40] Finished hmmbuild query in 1.798 seconds I0911 22:57:46.552916 140125146163008 hmmsearch.py:117] Launching sub-process ['/usr/conda/envs/alphalink/bin/hmmsearch', '--noali', '--cpu', '8', '--F1', '0.1', '--F2', '0.1', '--F3', '0.1', '--incE', '100', '-E', '100', '--domE', '100', '--incdomE', '100', '-A', '/tmp/tmp1hjv21c0/output.sto', '/tmp/tmp1hjv21c0/query.hmm', '/home/anatale/alphafold2/alphafold_data//pdb_seqres/pdb_seqres.txt'] I0911 22:57:46.553389 140125146163008 utils.py:36] Started hmmsearch (pdb_seqres.txt) query I0911 22:58:13.020413 140125146163008 utils.py:40] Finished hmmsearch (pdb_seqres.txt) query in 26.467 seconds run_alphalink.2.sh: line 24: 29939 Killed $al_python $al_code_dir/unifold/homo_search.py --fasta_path=$fasta_path --max_template_date=$max_template_date --output_dir=$output_dir_base --uniref90_database_path=$database_dir/uniref90/uniref90.fasta --mgnify_database_path=$database_dir/mgnify/mgy_clusters_2022_05.fa --bfd_database_path=$database_dir/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --uniclust30_database_path=$database_dir/uniclust30/uniclust30_2018_08/uniclust30_2018_08 --uniprot_database_path=$database_dir/uniprot/uniprot.fasta --pdb_seqres_database_path=$database_dir/pdb_seqres/pdb_seqres.txt --template_mmcif_dir=$database_dir/pdb_mmcif/mmcif_files --obsolete_pdbs_path=$database_dir/pdb_mmcif/obsolete.dat --use_precomputed_msas=True Starting prediction... 2023-09-11 22:59:26.162401: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. start to load params /home/anatale/alphafold2/alphafold_data/AlphaLink-Multimer_SDA_v3.pt start to predict RpoaRpoc Traceback (most recent call last): File "/home/mlevasseur/AlphaLink2/inference.py", line 363, in main(args) File "/home/mlevasseur/AlphaLink2/inference.py", line 141, in main batch = load_feature_for_one_target( File "/home/mlevasseur/AlphaLink2/inference.py", line 74, in load_feature_for_onetarget batch, = load_and_process( File "/home/mlevasseur/AlphaLink2/unifold/dataset.py", line 292, in load_and_process features, labels = load(load_kwargs, mode=mode, is_monomer=is_monomer) File "/home/mlevasseur/AlphaLink2/unifold/dataset.py", line 194, in load all_chain_features = [ File "/home/mlevasseur/AlphaLink2/unifold/dataset.py", line 195, in load_single_feature(s, monomer_feature_dir, mode, uniprot_msa_dir, is_monomer) File "/home/mlevasseur/AlphaLink2/unifold/data/utils.py", line 33, in wrapper return copy_lib.copy(cached_func(*args, *kwargs)) File "/home/mlevasseur/AlphaLink2/unifold/dataset.py", line 78, in load_single_feature monomer_feature = utils.load_pickle( File "/home/mlevasseur/AlphaLink2/unifold/data/utils.py", line 33, in wrapper return copy_lib.copy(cached_func(args, kwargs)) File "/home/mlevasseur/AlphaLink2/unifold/data/utils.py", line 67, in load_pickle ret = load(path) File "/home/mlevasseur/AlphaLink2/unifold/data/utils.py", line 64, in load with open_fn(path, "rb") as f: File "/usr/conda/envs/alphalink/lib/python3.10/gzip.py", line 58, in open binary_file = GzipFile(filename, gz_mode, compresslevel) File "/usr/conda/envs/alphalink/lib/python3.10/gzip.py", line 174, in init fileobj = self.myfileobj = builtins.open(filename, mode or 'rb') FileNotFoundError: [Errno 2] No such file or directory: '/home/mlevasseur/proteomics/XL/benchmark4/RpoaRpoc/B.feature.pkl.gz'

lhatsk commented 1 year ago

Hi, It's hard for me to tell what's going on. But it looks like the process that generates the features gets killed for some reason by your OS/ submission system: "run_alphalink.2.sh: line 24: 29939 Killed"

Do you maybe run out of memory?

maxochondria commented 1 year ago

Thanks for such a prompt reply! That indeed seems to be the problem. We ran another test which maxed out our 32GB of memory and crashed. How much memory do you recommend?

lhatsk commented 1 year ago

I honestly have no idea how much memory is generally required. I use a cluster and I'm fine with 80GB most of the time, for larger targets I increase it to 300GB. Maybe switching to the reduced_dbs setting in unifold/homo_search.py can limit the memory consumption. This setting requires the small_bfd database.

Bhaddad93 commented 3 months ago

Hi All, firstly thank you for creating alphalink2!

Unfortunately, I too am having problems with feature generation, and I have ensured all naming-schemes are consistent. The issue appears to be that it skips feature generation for 'seq_A', but completes for 'seq_B' and 'seq_C'..

input Fasta (seq.fasta):

A .... B .... C ....

chains.txt: 'A B C'

yet, I get the error that: a) chains.txt isn't found in outdir/seq/, (chains.txt exists in one folder above). b) when I copy chains.txt to outdir/seq/ I get the error that it cannot find A.features.pkl.gz, and indeed it doesn't exist, though B.features.pkl.gz and C.features.pkl.gz exist. I cannot understand why, presumably somewhere in unifold/homo_search.py, it is skipping the first fasta sequence, and going to the second.

+ run_alphalink.sh /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_0/seq.fasta /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/xlinks.patti.A2B.pkl.gz /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_0/ /home/bhaddad/alphalink/AlphaLink2/../AlphaLink-Multimer_SDA_v2.pt /home/bhaddad/alphafold/DataBases 2020-05-01 2024-06-21 12:02:00.660510: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. /home/bhaddad/miniconda3/envs/alphalink/lib/python3.10/site-packages/Bio/Data/SCOPData.py:18: BiopythonDeprecationWarning: The 'Bio.Data.SCOPData' module will be deprecated in a future release of Biopython in favor of 'Bio.Data.PDBData. warnings.warn( I0621 12:02:29.087209 139711209805632 templates.py:945] Using precomputed obsolete pdbs /home/bhaddad/alphafold/DataBases/pdb_mmcif/obsolete.dat. I0621 12:02:29.132328 139711209805632 utils.py:74] Mapping multi-chains fasta with chain order: A B C I0621 12:02:29.137532 139711209805632 homo_search.py:161] searching homogeneous Sequences & structures for seq_B... I0621 12:02:29.143747 139711209805632 jackhmmer.py:140] Launching subprocess "/home/bhaddad/miniconda3/envs/alphalink/bin/jackhmmer -o /dev/null -A /tmp/tmp42vu50n8/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /home/bhaddad/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB/dir_0/seq_B.fasta /home/bhaddad/alphafold/DataBases/uniref90/uniref90.fasta" I0621 12:02:29.158784 139711209805632 utils.py:36] Started Jackhmmer (uniref90.fasta) query

lhatsk commented 3 months ago

Hi,

Thanks for reporting this! There was a bug with chains.txt that should be fixed now. Please try again (ideally with a fresh output directory). For the rest, it's hard for me to tell what's going on. Chains should only be skipped if they are homomeric but here it would skip B instead of A. Can you share the FASTA file? What does the directory look like?

Bhaddad93 commented 3 months ago

Thank you for such a quick response! I am currently performing a fresh install of AlphaLink2, and I will provide an update as soon as it is run!

Unfortunately the PI of the project would prefer I keep the sequences 'close to the vest' until publication (currently under review), but I can say with the utmost confidence that chains 'B' and 'C' are identical. What appears to be happening, is it skips 'A' for whatever reason, performs featurization for chain B, then simply copies the features of chain B for chain C (predictably).

Chains B/C are the longer of the two unique sequences, with 367 AA. (fake sequence added below)

seq.fasta: 1 >chainA 2 GCEIGFIPSPVKLENMRLQHEQRAKQHTPPYDVVPSMRPVVLFI 3 >chainB 4 PGIADRMQKEITALAPSTMKIKIIAPPERKYSVWIGGSILASLSTF 5 >chainC 6 PGIADRMQKEITALAPSTMKIKIIAPPERKYSVWIGGSILASLSTF

lhatsk commented 3 months ago

Ok, no worries! Yes, the behaviour for B and C is expected. Strange about A, I will run a test.

Bhaddad93 commented 3 months ago

I will allow the job to continue to see how far it gets, however the problem appears to persist. For what it is worth, I believe the problem lies in unifold/msa/utils.py get_chain_id_map, I believe that 'mapping' is not being properly built, and that after line 30, the following should be added mapping[unique[seq]].append(chain_id) within the 'if not' statment... then again, when I tried this modification wasn't solved... so maybe it's not the right place to look. I believe this, because when I ask "/unifold/homo_search.py" to print 'chain_mapping' it returns {'B': [C]}, thus it is the only key which is fed into generate_pkl_features.

`(alphalink) Galvani [bhaddad@galvani ~/Lab_Files/AF2/For_Patti/AlphaLink2/AtoB]$ tail -f mpi-err.34171

lhatsk commented 3 months ago

Yeah, I know what the issue is. I pushed a fix. Sorry for the inconvenience! Hope this resolves everything...

Bhaddad93 commented 3 months ago

Indeed it appears to be working! Thank you very much!

Bhaddad93 commented 3 months ago

So a new problem has cropped up that seems perhaps unrelated. I thought to post it here, but also thought it may be appropriate as its own issue.

I am receiving an index-error

I0622 23:21:36.164898 139698480359232 utils.py:75] Mapping multi-chains fasta with chain order: A B C
I0622 23:21:36.170187 139698480359232 homo_search.py:161] searching homogeneous Sequences & structures for seq_A...
I0622 23:21:36.171731 139698480359232 homo_search.py:202] Final timings for seq_A: {}
I0622 23:21:36.174052 139698480359232 homo_search.py:161] searching homogeneous Sequences & structures for seq_B...
I0622 23:21:36.175636 139698480359232 homo_search.py:202] Final timings for seq_B: {}
2024-06-22 23:21:39.438556: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/home/bhaddad/alphalink/AlphaLink2/inference.py", line 363, in <module>
    main(args)
  File "/home/bhaddad/alphalink/AlphaLink2/inference.py", line 141, in main
    batch = load_feature_for_one_target(
  File "/home/bhaddad/alphalink/AlphaLink2/inference.py", line 74, in load_feature_for_one_target
    batch, _ = load_and_process(
  File "/home/bhaddad/alphalink/AlphaLink2/unifold/dataset.py", line 292, in load_and_process
    features, labels = load(**load_kwargs, mode=mode, is_monomer=is_monomer)
  File "/home/bhaddad/alphalink/AlphaLink2/unifold/dataset.py", line 217, in load
    xl = bin_crosslinks(tp,size)
  File "/home/bhaddad/alphalink/AlphaLink2/unifold/dataset.py", line 178, in bin_crosslinks
    xl_[r1,r2,0] = xl_[r2,r1,0] = torch.bucketize(1-fdr, buckets)
IndexError: index 1326 is out of bounds for axis 0 with size 1021
$ ls
dir_0  dir_1  INPUTS  jobscript_colabfold.sh  mpi-err.34172  mpi-err.34173  mpi-out.34172  mpi-out.34173  xlinks.patti.A2B.pkl.gz

$ ls dir_0
A                 A.timings.json    B                 B.timings.json    C.feature.pkl.gz   chains.txt        seq_A.fasta  seq_C.fasta
A.feature.pkl.gz  A.uniprot.pkl.gz  B.feature.pkl.gz  B.uniprot.pkl.gz  chain_id_map.json  C.uniprot.pkl.gz  seq_B.fasta  seq.fasta

$ ls dir_0/A
bfd_uniclust_hits.a3m  mgnify_hits.sto  pdb_hits.sto  uniprot_hits.sto  uniref90_hits.sto
run_alphalink.sh $MYCWD/$DIRBASE$i/${seqList[$i]}   \
                                         $MYCWD/xlinks.patti.A2B.pkl.gz     \
                     $MYCWD/$DIRBASE$i/             \
                     $AFL/../AlphaLink-Multimer_SDA_v2.pt   \
                     /home/bhaddad/alphafold/DataBases  \
                     2020-05-01 &
lhatsk commented 3 months ago

Hm, that would indicate a problem with the crosslinking data. The error says that the (mapped) crosslinking data overshoots the (combined) target length by more than 300 amino acids.

How long are your chains? Is the crosslinking data well aligned with the FASTA sequences? Ie no tags etc. chains.txt contains A B C, right?

Because I just noticed it: Your crosslinking pickle says A2B but your example above was A1B2, is the chain mapping off by chance?

Bhaddad93 commented 3 months ago

Indeed, this is a case of user-error!