chaidiscovery / chai-lab

Chai-1, SOTA model for biomolecular structure prediction
https://www.chaidiscovery.com
Other
1.28k stars 159 forks source link

Can I provide sdf of ligand instead of smiles? #136

Open LanternsSea opened 4 weeks ago

LanternsSea commented 4 weeks ago

Hi, I am impressed by your achievements, it's fantastic.

However, I am encountering a problem. The chirality of the ligand in the results seems to be incorrect. Is there a way to provide ligands information for prediction through an SDF file or another format instead of smile?

arogozhnikov commented 4 weeks ago

Do you have an example input where it fails to capture chirality?
Also, do you encode chirality in your smiles input?

LanternsSea commented 4 weeks ago

Here is the example: I get the smiles from RCSBPDB Isomeric SMILES and have checked that the chirality is right. But the result is wrong.

from pathlib import Path

import numpy as np
import torch

from chai_lab.chai1 import run_inference

# We use fasta-like format for inputs.
# - each entity encodes protein, ligand, RNA or DNA
# - each entity is labeled with unique name;
# - ligands are encoded with SMILES; modified residues encoded like AAA(SEP)AAA

# Example given below, just modify it

example_fasta = """
>protein|name=a
MTETILAAQIEVGEHHTATWLGMTVNTDTVLSTAIAGLIVIALAFYLRAKVTSTDVPGGVQLFFEAITIQM
RNQVESAIGMRIAPFVLPLAVTIFVFILISNWLAVLPVQYTDKHGHTTELLKSAAADINYVLALALFVFVC
YHTAGIWRRGIVGHPIKLLKGHVTLLAPINLVEEVAKPISLSLRLFGNIFAGGILVALIALFPPYIMWAPN
AIWKAFDLFVGAIQAFIFALLTILYFSQAMELEEEHH
>protein|name=c1
DPTIAAGALIGGGLIMAGGAIGAGIGDGVAGNALISGVARQPEAQGRLFTPFFITVGLVEAAYFINLAFM
ALFVFATPV
>protein|name=c2
DPTIAAGALIGGGLIMAGGAIGAGIGDGVAGNALISGVARQPEAQGRLFTPFFITVGLVEAAYFINLAFM
ALFVFATPV
>protein|name=c3
DPTIAAGALIGGGLIMAGGAIGAGIGDGVAGNALISGVARQPEAQGRLFTPFFITVGLVEAAYFINLAFM
ALFVFATPV
>protein|name=c4
DPTIAAGALIGGGLIMAGGAIGAGIGDGVAGNALISGVARQPEAQGRLFTPFFITVGLVEAAYFINLAFM
ALFVFATPV
>protein|name=c5
DPTIAAGALIGGGLIMAGGAIGAGIGDGVAGNALISGVARQPEAQGRLFTPFFITVGLVEAAYFINLAFM
ALFVFATPV
>protein|name=c6
DPTIAAGALIGGGLIMAGGAIGAGIGDGVAGNALISGVARQPEAQGRLFTPFFITVGLVEAAYFINLAFM
ALFVFATPV
>protein|name=c7
DPTIAAGALIGGGLIMAGGAIGAGIGDGVAGNALISGVARQPEAQGRLFTPFFITVGLVEAAYFINLAFM
ALFVFATPV
>protein|name=c8
DPTIAAGALIGGGLIMAGGAIGAGIGDGVAGNALISGVARQPEAQGRLFTPFFITVGLVEAAYFINLAFM
ALFVFATPV
>ligand|name=bdq
CN(C)CC[C@@](c1cccc2c1cccc2)([C@H](c3ccccc3)c4cc5cc(ccc5nc4OC)Br)O
""".strip()

fasta_path = Path("./example_fasta")
fasta_path.write_text(example_fasta)

output_dir = Path("./outputs")

candidates = run_inference(
    fasta_file=fasta_path,
    output_dir=output_dir,
    # 'default' setup
    num_trunk_recycles=3,
    num_diffn_timesteps=200,
    seed=42,
    device=torch.device("cuda:0"),
    use_esm_embeddings=True,
)

cif_paths = candidates.cif_paths
scores = [rd.aggregate_score for rd in candidates.ranking_data]

# Load pTM, ipTM, pLDDTs and clash scores for sample 2
scores = np.load(output_dir.joinpath("scores.model_idx_2.npz"))
Hyunsub-Ji commented 3 weeks ago

Hi all, could you solve this problem?