TencentAI4S / HuDiff

Code for humanization of antibody and nanobody
Other
24 stars 1 forks source link

#AbNatiV Installation #3

Open fanch1122 opened 4 weeks ago

fanch1122 commented 4 weeks ago

1.0 issue

The problem occurred during the AbNatiV installation. I don't know if I need to do a source code installation of openmm, which conflicts with the environment dependency of hudiff's python3.9.

2.0 repo

mamba install -c conda-forge pdbfixer

Looking for: ['pdbfixer']

conda-forge/linux-64 No change anaconda/pkgs/main/noarch No change anaconda/pkgs/r/linux-64 No change conda-forge/noarch No change anaconda/pkgs/main/linux-64 No change anaconda/pkgs/msys2/noarch No change anaconda/pkgs/r/noarch No change anaconda/pkgs/msys2/linux-64 No change

Pinned packages:

warning libmamba Added empty dependency for problem type SOLVER_RULE_UPDATE Could not solve for environment specs The following packages are incompatible ├─ cuda-command-line-tools is installable with the potential options │ ├─ cuda-command-line-tools 11.6.2 would require │ │ └─ cuda-sanitizer-api >=11.6.124 with the potential options │ │ ├─ cuda-sanitizer-api 12.0.90 would require │ │ │ └─ cuda-version >=12.0,<12.1.0a0 , which requires │ │ │ └─ cudatoolkit 12.0|12.0. , which can be installed; │ │ ├─ cuda-sanitizer-api [12.6.34|12.6.68|12.6.77] would require │ │ │ └─ cuda-version >=12.6,<12.7.0a0 , which requires │ │ │ └─ cudatoolkit 12.6|12.6. , which can be installed; │ │ ├─ cuda-sanitizer-api 12.1.105 would require │ │ │ └─ cuda-version >=12.1,<12.2.0a0 , which requires │ │ │ └─ cudatoolkit 12.1|12.1. , which can be installed; │ │ ├─ cuda-sanitizer-api 12.2.140 would require │ │ │ └─ cuda-version >=12.2,<12.3.0a0 , which requires │ │ │ └─ cudatoolkit 12.2|12.2. , which can be installed; │ │ ├─ cuda-sanitizer-api 12.3.101 would require │ │ │ └─ cuda-version >=12.3,<12.4.0a0 , which requires │ │ │ └─ cudatoolkit 12.3|12.3. , which can be installed; │ │ ├─ cuda-sanitizer-api [12.4.127|12.4.99] would require │ │ │ └─ cuda-version >=12.4,<12.5.0a0 , which requires │ │ │ └─ cudatoolkit 12.4|12.4. , which can be installed; │ │ └─ cuda-sanitizer-api [12.5.39|12.5.81] would require │ │ └─ cuda-version >=12.5,<12.6.0a0 , which requires │ │ └─ cudatoolkit 12.5|12.5. , which can be installed; │ └─ cuda-command-line-tools [12.0.0|12.1.1|...|12.6.2], which can be installed; ├─ cuda-gdb is installable with the potential options │ ├─ cuda-gdb 12.0.90 would require │ │ └─ cuda-version >=12.0,<12.1.0a0 , which can be installed (as previously explained); │ ├─ cuda-gdb [12.6.37|12.6.68|12.6.77] would require │ │ └─ cuda-version >=12.6,<12.7.0a0 , which can be installed (as previously explained); │ ├─ cuda-gdb 12.1.105 would require │ │ └─ cuda-version >=12.1,<12.2.0a0 , which can be installed (as previously explained); │ ├─ cuda-gdb 12.2.140 would require │ │ └─ cuda-version >=12.2,<12.3.0a0 , which can be installed (as previously explained); │ ├─ cuda-gdb 12.3.101 would require │ │ └─ cuda-version >=12.3,<12.4.0a0 , which can be installed (as previously explained); │ ├─ cuda-gdb [12.4.127|12.4.99] would require │ │ └─ cuda-version >=12.4,<12.5.0a0 , which can be installed (as previously explained); │ └─ cuda-gdb [12.5.39|12.5.82] would require │ └─ cuda-version >=12.5,<12.6.0a0 , which can be installed (as previously explained); ├─ cuda-sanitizer-api, which can be installed (as previously explained); ├─ pdbfixer is installable with the potential options │ ├─ pdbfixer [1.7|1.8|1.8.1] would require │ │ └─ openmm [>=7.5 |>=7.6 ] with the potential options │ │ ├─ openmm [7.5.0|7.5.1|7.6.0] would require │ │ │ └─ python >=3.6,<3.7.0a0 , which can be installed; │ │ ├─ openmm [7.5.0|7.5.1|7.6.0|7.7.0] would require │ │ │ └─ python >=3.7,<3.8.0a0 , which can be installed; │ │ ├─ openmm [7.5.0|7.5.1|...|8.1.2] would require │ │ │ └─ python >=3.8,<3.9.0a0 , which can be installed; │ │ ├─ openmm [7.5.0|7.5.1|7.6.0|7.7.0|8.0.0] would require │ │ │ └─ cudatoolkit 10.2|10.2. , which conflicts with any installable versions previously reported; │ │ ├─ openmm [7.5.0|7.5.1|7.6.0] would require │ │ │ └─ cudatoolkit 9.2|9.2. , which conflicts with any installable versions previously reported; │ │ ├─ openmm [7.5.0|7.5.1|7.6.0|7.7.0|8.0.0] would require │ │ │ └─ cudatoolkit 11.0|11.0. , which conflicts with any installable versions previously reported; │ │ ├─ openmm [7.5.0|7.5.1|7.6.0] would require │ │ │ └─ cudatoolkit 10.1|10.1. , which conflicts with any installable versions previously reported; │ │ ├─ openmm [7.5.0|7.5.1|7.6.0] would require │ │ │ └─ cudatoolkit 10.0|10.0. , which conflicts with any installable versions previously reported; │ │ ├─ openmm [7.5.0|7.5.1|7.6.0|7.7.0|8.0.0] would require │ │ │ └─ cudatoolkit 11.1|11.1.* , which conflicts with any installable versions previously reported; │ │ ├─ openmm [7.5.0|7.5.1|...|8.1.1] would require │ │ │ └─ cudatoolkit [>=11.2,<12 |>=11.2,<12.0a0 ], which conflicts with any installable versions previously reported; │ │ ├─ openmm [7.7.0|8.0.0|8.1.0|8.1.1|8.1.2] would require │ │ │ └─ python >=3.10,<3.11.0a0 , which can be installed; │ │ ├─ openmm [8.0.0|8.1.0|8.1.1|8.1.2] would require │ │ │ └─ libcufft >=11.0.0.21,<12.0a0 , which can be installed; │ │ ├─ openmm [8.0.0|8.1.0|8.1.1|8.1.2] would require │ │ │ └─ python >=3.11,<3.12.0a0 , which can be installed; │ │ ├─ openmm [8.0.0|8.1.0|8.1.1|8.1.2] would require │ │ │ └─ cudatoolkit >=11.8,<12 , which conflicts with any installable versions previously reported; │ │ ├─ openmm [8.1.0|8.1.1|8.1.2] would require │ │ │ └─ python >=3.12,<3.13.0a0 , which can be installed; │ │ └─ openmm 8.1.1 would require │ │ └─ libcufft >=11.0.8.103,<12.0a0 , which can be installed; │ └─ pdbfixer 1.9 would require │ └─ openmm >=8.0 with the potential options │ ├─ openmm [7.5.0|7.5.1|...|8.1.2], which can be installed (as previously explained); │ ├─ openmm [7.5.0|7.5.1|7.6.0|7.7.0|8.0.0], which cannot be installed (as previously explained); │ ├─ openmm [7.5.0|7.5.1|7.6.0|7.7.0|8.0.0], which cannot be installed (as previously explained); │ ├─ openmm [7.5.0|7.5.1|7.6.0|7.7.0|8.0.0], which cannot be installed (as previously explained); │ ├─ openmm [7.5.0|7.5.1|...|8.1.1], which cannot be installed (as previously explained); │ ├─ openmm [7.7.0|8.0.0|8.1.0|8.1.1|8.1.2], which can be installed (as previously explained); │ ├─ openmm [8.0.0|8.1.0|8.1.1|8.1.2], which can be installed (as previously explained); │ ├─ openmm [8.0.0|8.1.0|8.1.1|8.1.2], which can be installed (as previously explained); │ ├─ openmm [8.0.0|8.1.0|8.1.1|8.1.2], which cannot be installed (as previously explained); │ ├─ openmm [8.1.0|8.1.1|8.1.2], which can be installed (as previously explained); │ └─ openmm 8.1.1, which can be installed (as previously explained); └─ pytorch-cuda is not installable because it requires ├─ cuda-command-line-tools >=11.6,<11.7 , which conflicts with any installable versions previously reported; └─ libcufft >=10.7.0.55,<10.7.2.50 , which conflicts with any installable versions previously reported.

waitma commented 4 weeks ago

For training, there’s no need to install OpenMM from source code. You only need to install the checkpoints, which we’ve included in our provided tar.gz file. Since AbNatiV recently updated their repository, the installation process may differ from previous versions, which could potentially be causing the issue.

fanch1122 commented 4 weeks ago

For training, there’s no need to install OpenMM from source code. You only need to install the checkpoints, which we’ve included in our provided tar.gz file. Since AbNatiV recently updated their repository, the installation process may differ from previous versions, which could potentially be causing the issue.

Yes, I tried to uninstall the conflicting software and install pdbfixer, but the python3.8 environment and many dependencies required by AbNatiV are still a headache. Thank you for your great work and responsese

best wish~

fanch1122 commented 2 weeks ago

Then how should we fine-tune the nanobody model with our own data in order to better use hudiff_nb? Then how should we fine-tune the nanobody model with our own data in order to better use hudiff_nb?

waitma commented 2 weeks ago

Then how should we fine-tune the nanobody model with our own data in order to better use hudiff_nb? Then how should we fine-tune the nanobody model with our own data in order to better use hudiff_nb?

To fine-tune the nanobody model with your own data for better use with hudiff_nb, start by reviewing our fine-tuning dataset for hudiff-nb. In our process, sequences were first aligned using IMGT and then realigned with AHo (AbNativ). To match our data format, align your sequences twice using the same approach, and then you’ll be ready to fine-tune hudiff-nb with your own dataset.

fanch1122 commented 2 weeks ago

Then how should we fine-tune the nanobody model with our own data in order to better use hudiff_nb? Then how should we fine-tune the nanobody model with our own data in order to better use hudiff_nb?

To fine-tune the nanobody model with your own data for better use with hudiff_nb, start by reviewing our fine-tuning dataset for hudiff-nb. In our process, sequences were first aligned using IMGT and then realigned with AHo (AbNativ). To match our data format, align your sequences twice using the same approach, and then you’ll be ready to fine-tune hudiff-nb with your own dataset. sequences were first aligned using IMGT and then realigned with AHo (AbNativ). Can this part of the code be provided? At the same time, I hope to use nano antibody sequences outside the OSA database for fine-tuning. Can I match your format if I only have nano sequences?

waitma commented 2 weeks ago

Then how should we fine-tune the nanobody model with our own data in order to better use hudiff_nb? Then how should we fine-tune the nanobody model with our own data in order to better use hudiff_nb?

To fine-tune the nanobody model with your own data for better use with hudiff_nb, start by reviewing our fine-tuning dataset for hudiff-nb. In our process, sequences were first aligned using IMGT and then realigned with AHo (AbNativ). To match our data format, align your sequences twice using the same approach, and then you’ll be ready to fine-tune hudiff-nb with your own dataset. sequences were first aligned using IMGT and then realigned with AHo (AbNativ). Can this part of the code be provided? At the same time, I hope to use nano antibody sequences outside the OSA database for fine-tuning. Can I match your format if I only have nano sequences?

For the IMGT alignment, we directly used results from the OAS database. If you need to start the alignment from scratch, abnumber can be used for this step. For realigning with AHo, you can find the code in the AbNativ repository (see the relevant code section here). For sequences outside the OAS database, just follow our data format, and they should be compatible.

fanch1122 commented 1 day ago

Does hudiff_nb have parallel humanization scripts for multiple nano sequences? For example, if I have more than 1000 nano sequences to process, should I need a separate SLURM job for each sequence (e.g., large computational workload)? If resources allow, can I also consider using multithreading or distributed computing to speed up the process?

fanch1122 commented 23 hours ago

Does hudiff_nb have parallel humanization scripts for multiple nano sequences? For example, if I have more than 1000 nano sequences to process, should I need a separate SLURM job for each sequence (e.g., large computational workload)? If resources allow, can I also consider using multithreading or distributed computing to speed up the process?

multiple_for_nano.py

""" This script only consider for the nanobodies. """ import os.path

import numpy as np import torch from tqdm import tqdm import argparse import pandas as pd from abnumber import Chain from anarci import anarci, number from copy import deepcopy import re from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio import SeqIO

from nanosample import (batch_input_element, save_nano, seqs_to_fasta, compare_length, get_diff_region_aa_seq, get_pad_seq, get_input_element, get_nano_line, out_humanization_df, save_seq_to_fasta, split_fasta_for_save, get_multi_model_state ) from utils.tokenizer import Tokenizer from utils.train_utils import model_selected from utils.misc import get_new_log_dir, get_logger, seed_all

Finetune package

from model.nanoencoder.abnativ_model import AbNatiV_Model from model.nanoencoder.model import NanoAntiTFNet

def get_all_nano_seqs_from_fasta(fpath): """ """ nano_sequences = {} sequences = SeqIO.parse(fpath, 'fasta') for seq in sequences: if 'Nanobody' in seq.description: nano_sequences[seq.description] = str(seq.seq) if not nano_sequences: raise ValueError("No Nanobody sequences found in the fasta file.") return nano_sequences

if name == 'main': parser = argparse.ArgumentParser(description="This program is designed to humanize non-human nanobodies.") parser.add_argument('--ckpt', type=str, default=None, help='The ckpt path of the pretrained path.' ) parser.add_argument('--nano_complex_fasta', type=str, default=None, help='fasta file of the nanobody.' ) parser.add_argument('--batch_size', type=int, default=10, help='the batch size of sample.' ) parser.add_argument('--sample_number', type=int, default=100, help='The number of all sample.' ) parser.add_argument('--seed', type=int, default=42 ) parser.add_argument('--sample_order', type=str, default='shuffle') parser.add_argument('--sample_method', type=str, default='gen', choices=['gen', 'rl_gen']) parser.add_argument('--length_limit', type=str, default='not_equal') parser.add_argument('--model', type=str, default='finetune_vh', choices=['pretrain', 'finetune_vh']) parser.add_argument('--fa_version', type=str, default='v_nano') parser.add_argument('--inpaint_sample', type=eval, default=True) parser.add_argument('--structure', type=eval, default=False) args = parser.parse_args()

batch_size = args.batch_size
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# seed_all(args.seed)

# Make sure the name of sample log.
pdb_name = os.path.basename(args.nano_complex_fasta).split('.')[0]
sample_tag = f'{pdb_name}_{args.model}_vhh'

# log dir
log_path = os.path.dirname(args.nano_complex_fasta)
log_dir = get_new_log_dir(
    root=log_path,
    prefix=sample_tag
)
logger = get_logger('test', log_dir)

# Here we specify the finetune model to generate the humanization seq.
ckpt = torch.load(args.ckpt)
config = ckpt['config']
abnativ_state, _, infilling_state = get_multi_model_state(ckpt)
# Abnativ model.
hparams = ckpt['abnativ_params']
abnativ_model = AbNatiV_Model(hparams)
abnativ_model.load_state_dict(abnativ_state)
abnativ_model.to(device)
# infilling model.
# infilling_params = config.model
infilling_params = ckpt['infilling_params']
infilling_model = NanoAntiTFNet(**infilling_params)
infilling_model.load_state_dict(infilling_state)
infilling_model.to(device)

# Carefull!!! tmp
config.model['equal_weight'] = True
config.model['vhh_nativeness'] = False
config.model['human_threshold'] = None
config.model['human_all_seq'] = False
config.model['temperature'] = False

model_dict = {
    'abnativ': abnativ_model,
    'infilling': infilling_model,
    'target_infilling': infilling_model
}
framework_model = model_selected(config, pretrained_model=model_dict, tokenizer=Tokenizer())
model = framework_model.infilling_pretrain
model.eval()

logger.info(args.ckpt)
logger.info(args.seed)

# Parse fasta file
nano_sequences = get_all_nano_seqs_from_fasta(args.nano_complex_fasta)

# Set up logging and result saving
log_path = os.path.dirname(args.nano_complex_fasta)
log_dir = get_new_log_dir(root=log_path, prefix='humanization_vhh')
logger = get_logger('test', log_dir)

save_fpath = os.path.join(log_dir, 'sample_humanization_result.csv')
with open(save_fpath, 'w', encoding='UTF-8') as f:
    f.write('Specific,name,hseq\n')

# Process each sequence
for description, nano_chain in nano_sequences.items():
    logger.info(f'Processing nanobody: {description}')

    try:
        nano_pad_token, nano_pad_region, nano_loc, ms_tokenizer = batch_input_element(
            nano_chain,
            inpaint_sample=args.inpaint_sample,
            batch_size=batch_size
        )
    except Exception as e:
        logger.error(f'Error processing sequence {description}: {e}')
        continue

    sample_number = args.sample_number
    duplicated_set = set()

    while sample_number > 0:
        with torch.no_grad():
            for i in tqdm(nano_loc, total=len(nano_loc), desc=f'Humanizing {description}'):
                nano_prediction = model(
                    nano_pad_token.to(device),
                    nano_pad_region.to(device),
                    H_chn_type=None
                )
                nano_pred = nano_prediction[:, i, :len(ms_tokenizer.toks)-1]
                nano_soft = torch.nn.functional.softmax(nano_pred, dim=1)
                nano_sample = torch.multinomial(nano_soft, num_samples=1)
                nano_pad_token[:, i] = nano_sample.squeeze()

        nano_untokenized = [ms_tokenizer.idx2seq(s) for s in nano_pad_token]
        for g_h in nano_untokenized:
            if sample_number == 0:
                break
            if g_h not in duplicated_set:
                with open(save_fpath, 'a', encoding='UTF-8') as f:
                    f.write(f'humanization,{description},{g_h}\n')
                duplicated_set.add(g_h)
                sample_number -= 1
                logger.info(f'Processed {args.sample_number - sample_number} samples for {description}')

# Save results to fasta
fasta_save_fpath = os.path.join(log_dir, 'sample_identity.fa')
sample_df = pd.read_csv(save_fpath)
sample_human_df = sample_df[sample_df['Specific'] == 'humanization'].reset_index(drop=True)
seqs_to_fasta(sample_human_df, fasta_save_fpath, version=args.fa_version)

logger.info(f'Results saved to {save_fpath} and {fasta_save_fpath}')

This modified script is not sufficient to quickly resolve the problem of multi-nano sequence humanization.Do you have a more practical method?