baker-laboratory / RoseTTAFold-All-Atom

Other
596 stars 104 forks source link

Small fasta sequences cause error #42

Open tydingcw opened 5 months ago

tydingcw commented 5 months ago

Prediction with small fasta sequences causes errors. I think this may be due to not finding a matching template.

test.fasta IALAVAALF

Running PSIPRED Running hhsearch Error executing job with overrides: [] Traceback (most recent call last): File "/home/tydingcw/git_repos/RoseTTAFold-All-Atom/rf2aa/run_inference.py", line 206, in main runner.infer() File "/home/tydingcw/git_repos/RoseTTAFold-All-Atom/rf2aa/run_inference.py", line 153, in infer self.parse_inference_config() File "/home/tydingcw/git_repos/RoseTTAFold-All-Atom/rf2aa/run_inference.py", line 46, in parse_inference_config protein_input = generate_msa_and_load_protein( File "/home/tydingcw/git_repos/RoseTTAFold-All-Atom/rf2aa/data/protein.py", line 93, in generate_msa_and_load_protein return load_protein(str(msa_file), str(hhr_file), str(atab_file), model_runner) File "/home/tydingcw/git_repos/RoseTTAFold-All-Atom/rf2aa/data/protein.py", line 66, in load_protein xyz_t, t1d, maskt, = get_templates( File "/home/tydingcw/git_repos/RoseTTAFold-All-Atom/rf2aa/data/protein.py", line 30, in get_templates ) = parse_templates_raw(ffdb, hhr_fn=hhr_fn, atab_fn=atab_fn) File "/home/tydingcw/git_repos/RoseTTAFold-All-Atom/rf2aa/data/parsers.py", line 684, in parse_templates_raw xyz = np.vstack(xyz).astype(np.float32) File "/home/tydingcw/mambaforge/envs/RFAA/lib/python3.10/site-packages/numpy/core/shape_base.py", line 289, in vstack return _nx.concatenate(arrs, 0, dtype=dtype, casting=casting) ValueError: need at least one array to concatenate

teemuronkko commented 4 months ago

Hey! I ran into the same issue with small peptide sequences for which there was no template hits found. I added a check in data/parsers.py that checks whether the lists storing different attributes from the templates are empty or not. If they are empty, then I create empty numpy arrays with the correct dimensions as the issue is caused by the np.vstack function that doesn't work on lists. With this fix, I was able to run complex predictions using RF-AA even for very short peptides (even just 3 amino acids).

So, in data/parsers.py, I modified the function parse_templates_raw (lines 684 to 690) to include the following:

  if len(ids) > 0:
      xyz = np.vstack(xyz).astype(np.float32)
      mask = np.vstack(mask).astype(bool)
      qmap = np.vstack(qmap).astype(np.int64)
      f0d = np.vstack(f0d).astype(np.float32)
      f1d = np.vstack(f1d).astype(np.float32)
      seq = np.hstack(seq).astype(np.int64)
  else:
      xyz = np.empty((0,3)).astype(np.float32)
      mask = np.empty((0)).astype(bool)
      qmap = np.empty((0)).astype(np.int64)
      f0d = np.empty((0)).astype(np.float32)
      f1d = np.empty((0)).astype(np.float32)
      seq = np.empty((0)).astype(np.int64)

Hope this helps!

Guanyueweiyang commented 1 month ago

@teemuronkko After modifying the data/parsers.py file, whether the tripeptides were predicted separately or the protein-tripeptide complex structure was predicted, the results showed that the tripeptides were not linked together, but three independent amino acids. Is there any way to improve this function?