Could not match sequence in alignment to sequence from structure

wt12318 commented 1 year ago

Hi,

Sorry to bother again. I used the following code to do template search:

import homelette as hm
print("####read fasta file#####")
gen = hm.alignment.AlignmentGenerator_hhblits.from_fasta("fasta/CD19.fa")
print("#####search template######")
gen.get_suggestion(database_dir="/home/hhsuite_dbs/", n_threads=4)
temp = gen.show_suggestion()
target = "CD19"
gen.select_templates(list(temp.template)[0:9])

gen.alignment.print_clustal(70)

CD19        DIQMTQTTSSLSASLGDRVTISCRASQDIS----KYLNWYQQKPDGTVKLLIYHTSRLHSGVPSRFSGSG
6LFW_A      -IQMTQTASSLSASLGDRVTISCRASQYIN----NYLNWYQQKPDGTVTLLIYYTSILHSGVPSRFIGSG
6LFX_A      -IQMTQTASSLSASLGDRVTISCRASQYIN----NYLNWYQQKPDGTVTLLIYYTSILHSGVPSRFIGSG
3ESU_F      --VLIQSTSSLSASLGDRVTISCRASQDIR----NYLNWYQQKPDGTVKLLIYYTSRLQSGVPSRFSGSG
3ESV_G      --QMTQTTSSLSASLGDRVTVSCRASQDIR----NYLNWYQQKPDGTVKFLIYYTSRLQPGVPSRFSGSG
3AUV_D      --QMTQSPSSLSASVGDRVTITCRASQDVS----TAVAWYQQKPGKAPKLLIYSASFLYSGVPSRFSGSG
3AUV_E      --QMTQSPSSLSASVGDRVTITCRASQDVS----TAVAWYQQKPGKAPKLLIYSASFLYSGVPSRFSGSG
6TCS_A      -IQLTQSPSSLSASVGDRVTITCRASQSVDYDGDSYMNWYQQKPGKAPKLLIYAASYLESGVPSRFSGSG
5OGI_B      -IVMTQSPSSLSASVGDRVTITCKASQSVT----NDAAWYQKKPGKAPKLLIYQASTRYTGVPSRFSGSG
6EQC_D      -IVMTQSPSSLSASVGDRVTITCKASQSVT----NDAAWYQKKPGKAPKLLIYQASTRYTGVPSRFSGSG

CD19        SGTDYSLTISNLEQEDIATYFCQQGNTLPYTFGGGTKLEITGSTSGSGKPGS-----GEGSTKGEVKLQE
6LFW_A      SGTDYSLTISNLDQEDIATYFCQQGYTLPLTFGAGTKLELKRPGGGGSGGGGSG---GGGS-VDEVKLVE
6LFX_A      SGTDYSLTISNLDQEDIATYFCQQGYTLPLTFGAGTKLELKRPGGGGSGGGGSG---GGGS-VDEVKLVE
3ESU_F      SGTDYSLTISNLEQEDIGTYFCQQGNTLPWTFGGGTKLEIRRGGGGSGGGGSGGGG-SGGG-GSEVQLQQ
3ESV_G      SGTDYSLTINNLEQEDIGTYFCQQGNTPPWTFGGGTKLEIKRGGGGSGGGGSGGGG-SGGG-GSEVQLQQ
3AUV_D      SGTDFTLTISSLQPEDFATYYCQQHYTTPPTFGCGTKVEIKGGGGGSIEGR------SGGG-GSEVQLVE
3AUV_E      SGTDFTLTISSLQPEDFATYYCQQHYTTPPTFGCGTKVEIKGGGGGSIEGR------SGGG-GSEVQLVE
6TCS_A      SGTDFTLTISSLQPEDFATYYCQQSHEDPYTFGCGTKVEIKRTGGGGSGGGGSGGGGSGGG-GSEVQLVE
5OGI_B      YGTDFTLTISSLQPEDFATYFCHQDYSSPLTFGQGTKVEIKRGGGGSGGGG------SGGG-GSQVQLVQ
6EQC_D      YGTDFTLTISSLQPEDFATYFCHQDYSSPLTFGQGTKVEIKRGGGGSGGGG------SGGG-GSQVQLVQ

CD19        SGPGLVAPSQSLSVTCTVSGVSLPD-YGVSWIRQPPRKGLEWLGVIWGS-ETTYYNSALKSRLTIIKDNS
6LFW_A      SGGGLVKPGGSLKLSCAASGFTFST-YGMSWVRQTPEKRLEWVASISGG-GDTYYTDIVKGRVTISRDND
6LFX_A      SGGGLVKPGGSLKLSCAASGFTFST-YGMSWVRQTPEKRLEWVASISGG-GDTYYTDIVKGRVTISRDND
3ESU_F      SGPELVKPGASVKISCKDSGYAFSS-SWMNWVKQRPGQGPEWIGRIYPGDGDTNYNGKFKGKATLTADKS
3ESV_G      SGPELVKPGASVKISCKDSGYAFNS-SWMNWVKQRPGQGLEWIGRIYPGDGDSNYNGKFEGKAILTADKS
3AUV_D      SGGGLVQPGGSLRLSCAASGFTISD-YWIHWVRQAPGKCLEWVAGITPAGGYTYYADSVKGRFTISADTS
3AUV_E      SGGGLVQPGGSLRLSCAASGFTISD-YWIHWVRQAPGKCLEWVAGITPAGGYTYYADSVKGRFTISADTS
6TCS_A      SGGGLVQPGGSLRLSCAVSGYSITSGYSWNWIRQAPGKCLEWVASITYD-GSTNYNPSVKGRITISRDDS
5OGI_B      SGAEDKKPGASVKVSCKVSGFSLGR-YGVHWVRQAPGQGLEWMGVIWRG-GTTDYNAKFQGRVTITKDDS
6EQC_D      SGAEDKKPGASVKVSCKVSGFSLGR-YGVHWVRQAPGQGLEWMGVIWRG-GTTDYNAKFQGRVTITKDDS

CD19        KSQVFLKMNSLQTDDTAIYYCAKHYYYGG--SYAMDYWGQGTSVTVSS
6LFW_A      RNILYLQMSSLRSEDTAMYHCTRGALVRLPHYYSMDYWGQGTSVTVS-
6LFX_A      RNILYLQMSSLRSEDTAMYHCTRGALVRLPHYYSMDYWGQGTSVTVS-
3ESU_F      SSTAYMQLSSLTSVDSAVYFCARSGLLR----YAMDYWGQGTSVTVSS
3ESV_G      SSTAYMQLSSLTSVDSAVYFCARSGLLR----YAMDYWGQGTSVTVSS
3AUV_D      KNTAYLQMNSLRAEDTAVYYCARFVFFLP---YAMDYWGQGTLVTVS-
3AUV_E      KNTAYLQMNSLRAEDTAVYYCARFVFFLP---YAMDYWGQGTLVTVS-
6TCS_A      KNTFYLQMNSLRAEDTAVYYCARGSHYFG--HWHFAVWGQGTLVTVS-
5OGI_B      KSTVYMELSSLRSEDTAVYYCARQGSN-----FPLAYWGQGTLVTVS-
6EQC_D      KSTVYMELSSLRSEDTAVYYCARQGSN-----FPLAYWGQGTLVTVS-

When I got PDB, it show warning and the 6TCS_A was lost:

gen.get_pdbs(pdb_format="polymer_entity_instance")

Checking template dir...
Template dir found!

Processing templates:

6LFW downloading from PDB...
6LFW downloaded!
6LFW_A: Chain extracted!
6LFW_A: Alignment updated!
6LFW_A: PDB processed!
6LFX downloading from PDB...
6LFX downloaded!
6LFX_A: Chain extracted!
6LFX_A: Alignment updated!
6LFX_A: PDB processed!
3AUV downloading from PDB...
3AUV downloaded!
3AUV_A: Chain extracted!
3AUV_A: Alignment updated!
3AUV_A: PDB processed!
3AUV_B: Chain extracted!
3AUV_B: Alignment updated!
3AUV_B: PDB processed!
3AUV_C: Chain extracted!
3AUV_C: Alignment updated!
3AUV_C: PDB processed!
3AUV_D: Chain extracted!
3AUV_D: Alignment updated!
3AUV_D: PDB processed!
3AUV_E: Chain extracted!
3AUV_E: Alignment updated!
3AUV_E: PDB processed!
3AUV_F: Chain extracted!
3AUV_F: Alignment updated!
3AUV_F: PDB processed!
6TCS downloading from PDB...
6TCS downloaded!
6TCS_A: Chain extracted!
5OGI downloading from PDB...
/usr/local/src/homelette-1.3/homelette/alignment.py:1973: UserWarning: 6TCS_A: Could not match sequence in alignment to sequence from structure.
  warnings.warn(msg)
5OGI downloaded!
5OGI_B: Chain extracted!
5OGI_B: Alignment updated!
5OGI_B: PDB processed!
6EQC downloading from PDB...
6EQC downloaded!
6EQC_D: Chain extracted!
6EQC_D: Alignment updated!
6EQC_D: PDB processed!
6EQC_E: Chain extracted!
6EQC_E: Alignment updated!
6EQC_E: PDB processed!
6EQC_F: Chain extracted!
6EQC_F: Alignment updated!
6EQC_F: PDB processed!

Finishing... All templates successfully
downloaded and processed!
Templates can be found in
"/home/templates".

gen.show_suggestion() ##not have 6TCS_A
template    coverage    identity
1   6LFX_A  94.69   62.86
0   6LFW_A  92.24   62.45
5   3AUV_D  91.02   55.51
2   3AUV_A  90.61   55.51
3   3AUV_B  90.20   55.10
4   3AUV_C  90.20   55.10
6   3AUV_E  90.20   55.10
7   3AUV_F  89.80   54.69
8   5OGI_B  91.02   51.43
9   6EQC_D  91.02   51.43
10  6EQC_E  91.02   51.43
11  6EQC_F  91.02   51.43

How to handle this problem (I want to include the 6TCS_A). Thanks.

wt12318 commented 1 year ago

I found the sequence from rcsb Sequence panel : https://www.rcsb.org/sequence/6TCS (Download file--Download the fasta file) is the same of sequence got from get_entities_from_instances (seq_entity), but the sequence got from PDB file (pdb_chain.get_sequence(), seq_template) is different:

seq_template, seq_entity
('DIQLTQSPSSLSASVGDRVTITCRASQSVDYDGDSYMNWYQQKPGKAPKLLIYAASYLESGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSHEDPYTFGCGTKVEIKEVQLVESGGGLVQPGGSLRLSCAVSGYSITSGYSWNWIRQAPGKCLEWVASITYDGSTNYNPSVKGRITISRDDSKNTFYLQMNSLRAEDTAVYYCARGSHYFGHWHFAVWGQGTLVTVSS',
 'DIQLTQSPSSLSASVGDRVTITCRASQSVDYDGDSYMNWYQQKPGKAPKLLIYAASYLESGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSHEDPYTFGCGTKVEIKRTGGGGSGGGGSGGGGSGGGGSEVQLVESGGGLVQPGGSLRLSCAVSGYSITSGYSWNWIRQAPGKCLEWVASITYDGSTNYNPSVKGRITISRDDSKNTFYLQMNSLRAEDTAVYYCARGSHYFGHWHFAVWGQGTLVTVSSENLYFQGSGGGGSHHHHHHHHHH')

###a was download from Sequence panel of RCSB
a = "DIQLTQSPSSLSASVGDRVTITCRASQSVDYDGDSYMNWYQQKPGKAPKLLIYAASYLESGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSHEDPYTFGCGTKVEIKRTGGGGSGGGGSGGGGSGGGGSEVQLVESGGGLVQPGGSLRLSCAVSGYSITSGYSWNWIRQAPGKCLEWVASITYDGSTNYNPSVKGRITISRDDSKNTFYLQMNSLRAEDTAVYYCARGSHYFGHWHFAVWGQGTLVTVSSENLYFQGSGGGGSHHHHHHHHHH"
a == seq_entity 
True

PhilippJunk commented 1 year ago

Hi,

thanks for the interest in the package, you are not bothering me. It's nice to have someone vigorously test it.

That is an interesting bug you describe there. The way in which PDB files are weird never ceases to surprise me.

I can reconstruct what is happening, but I don't have a solution for it yet.

Internally, in get_pdbs, homelette calls the following methods:

import homelette as hm

pdb = hm.pdb_io.download_pdb('6TCS')
pdb_chain = pdb.transform_extract_chain('A')
seq_template = pdb_chain.get_sequence(ignore_missing=False)

The value of seq_template is the following:

DIQLTQSPSSLSASVGDRVTITCRASQSVDYDGDSYMNWYQQKPGKAPKLLIYAASYLESGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSHEDPYTFGCGTKVEIKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEVQLVESGGGLVQPGGSLRLSCAVSGYSITSGYSWNWIRQAPGKCLEWVASITYDGSTNYNPSVKGRITISRDDSKNTFYLQMNSLRAEDTAVYYCARGSHYFGHWHFAVWGQGTLVTVSS

What is happening, is that in the PDB file, the amino acids are not sequentially numbered, but there is a jump from amino acid 111 to 1001. Because of this, homelette thinks that there are 890 missing residues in the middle and tries to match this to the sequence in the alignment, which obviously fails.

Inferring the length of sequence from the residue numbers in the PDB file sadly is kind of necessary for noticing missing residues in templates and correctly adapting the alignment to it.

I will think about it for a bit, managing your problem should be pretty straightforward and I will come back to you soon. Not sure however if I can catch that behavior on a global level.

wt12318 commented 1 year ago

The SEQRES field of PDB file seems the same as the seq_entity and also the downloaded fasta from RCSB:

SEQRES   1 A  277  ASP ILE GLN LEU THR GLN SER PRO SER SER LEU SER ALA          
SEQRES   2 A  277  SER VAL GLY ASP ARG VAL THR ILE THR CYS ARG ALA SER          
SEQRES   3 A  277  GLN SER VAL ASP TYR ASP GLY ASP SER TYR MET ASN TRP          
SEQRES   4 A  277  TYR GLN GLN LYS PRO GLY LYS ALA PRO LYS LEU LEU ILE          
SEQRES   5 A  277  TYR ALA ALA SER TYR LEU GLU SER GLY VAL PRO SER ARG          
SEQRES   6 A  277  PHE SER GLY SER GLY SER GLY THR ASP PHE THR LEU THR          
SEQRES   7 A  277  ILE SER SER LEU GLN PRO GLU ASP PHE ALA THR TYR TYR          
SEQRES   8 A  277  CYS GLN GLN SER HIS GLU ASP PRO TYR THR PHE GLY CYS          
SEQRES   9 A  277  GLY THR LYS VAL GLU ILE LYS ARG THR GLY GLY GLY GLY          
SEQRES  10 A  277  SER GLY GLY GLY GLY SER GLY GLY GLY GLY SER GLY GLY          
SEQRES  11 A  277  GLY GLY SER GLU VAL GLN LEU VAL GLU SER GLY GLY GLY          
SEQRES  12 A  277  LEU VAL GLN PRO GLY GLY SER LEU ARG LEU SER CYS ALA          
SEQRES  13 A  277  VAL SER GLY TYR SER ILE THR SER GLY TYR SER TRP ASN          
SEQRES  14 A  277  TRP ILE ARG GLN ALA PRO GLY LYS CYS LEU GLU TRP VAL          
SEQRES  15 A  277  ALA SER ILE THR TYR ASP GLY SER THR ASN TYR ASN PRO          
SEQRES  16 A  277  SER VAL LYS GLY ARG ILE THR ILE SER ARG ASP ASP SER          
SEQRES  17 A  277  LYS ASN THR PHE TYR LEU GLN MET ASN SER LEU ARG ALA          
SEQRES  18 A  277  GLU ASP THR ALA VAL TYR TYR CYS ALA ARG GLY SER HIS          
SEQRES  19 A  277  TYR PHE GLY HIS TRP HIS PHE ALA VAL TRP GLY GLN GLY          
SEQRES  20 A  277  THR LEU VAL THR VAL SER SER GLU ASN LEU TYR PHE GLN          
SEQRES  21 A  277  GLY SER GLY GLY GLY GLY SER HIS HIS HIS HIS HIS HIS          
SEQRES  22 A  277  HIS HIS HIS HIS

PhilippJunk commented 1 year ago

OK, it was a bit more complicated than I thought, since 6TCS actually has both problems:

there are missing residues
they are masked by an arbitrarily large gap

I am not 100% sure how I would use the sequence that should be there to construct a fool-proof logic of how the residues have to be re-arranged, but thanks for referring to the SEQRES entries.

I will leave this issue open for now and try to come up with an idea how I could catch this behavior automatically in the future

Now, for your problem, you could hack 6TCS back into the alignment generator like this:

import copy
import os.path
import re

import homelette as hm

###################################################################################
# manually reconstructing the alignment generator in the state after hhblits search
aln_dict = {
    'CD19': (
        'DIQMTQTTSSLSASLGDRVTISCRASQDIS----KYLNWYQQKPDGTVKLLIYHTSRLHSGVPSRFSGSG' +
        'SGTDYSLTISNLEQEDIATYFCQQGNTLPYTFGGGTKLEITGSTSGSGKPGS-----GEGSTKGEVKLQE' +
        'SGPGLVAPSQSLSVTCTVSGVSLPD-YGVSWIRQPPRKGLEWLGVIWGS-ETTYYNSALKSRLTIIKDNS' +
        'KSQVFLKMNSLQTDDTAIYYCAKHYYYGG--SYAMDYWGQGTSVTVSS'
    ),
    '6LFW_A': (
        '-IQMTQTASSLSASLGDRVTISCRASQYIN----NYLNWYQQKPDGTVTLLIYYTSILHSGVPSRFIGSG' +
        'SGTDYSLTISNLDQEDIATYFCQQGYTLPLTFGAGTKLELKRPGGGGSGGGGSG---GGGS-VDEVKLVE' +
        'SGGGLVKPGGSLKLSCAASGFTFST-YGMSWVRQTPEKRLEWVASISGG-GDTYYTDIVKGRVTISRDND' +
        'RNILYLQMSSLRSEDTAMYHCTRGALVRLPHYYSMDYWGQGTSVTVS-'
    ),
    '6LFX_A': (
        '-IQMTQTASSLSASLGDRVTISCRASQYIN----NYLNWYQQKPDGTVTLLIYYTSILHSGVPSRFIGSG' +
        'SGTDYSLTISNLDQEDIATYFCQQGYTLPLTFGAGTKLELKRPGGGGSGGGGSG---GGGS-VDEVKLVE' +
        'SGGGLVKPGGSLKLSCAASGFTFST-YGMSWVRQTPEKRLEWVASISGG-GDTYYTDIVKGRVTISRDND' +
        'RNILYLQMSSLRSEDTAMYHCTRGALVRLPHYYSMDYWGQGTSVTVS-'
    ),
    '3ESU_F': (
        '--VLIQSTSSLSASLGDRVTISCRASQDIR----NYLNWYQQKPDGTVKLLIYYTSRLQSGVPSRFSGSG' +
        'SGTDYSLTISNLEQEDIGTYFCQQGNTLPWTFGGGTKLEIRRGGGGSGGGGSGGGG-SGGG-GSEVQLQQ' +
        'SGPELVKPGASVKISCKDSGYAFSS-SWMNWVKQRPGQGPEWIGRIYPGDGDTNYNGKFKGKATLTADKS' +
        'SSTAYMQLSSLTSVDSAVYFCARSGLLR----YAMDYWGQGTSVTVSS'
    ),
    '3ESV_G': (
        '--QMTQTTSSLSASLGDRVTVSCRASQDIR----NYLNWYQQKPDGTVKFLIYYTSRLQPGVPSRFSGSG' +
        'SGTDYSLTINNLEQEDIGTYFCQQGNTPPWTFGGGTKLEIKRGGGGSGGGGSGGGG-SGGG-GSEVQLQQ' +
        'SGPELVKPGASVKISCKDSGYAFNS-SWMNWVKQRPGQGLEWIGRIYPGDGDSNYNGKFEGKAILTADKS' +
        'SSTAYMQLSSLTSVDSAVYFCARSGLLR----YAMDYWGQGTSVTVSS'
    ),
    '3AUV_D': (
        '--QMTQSPSSLSASVGDRVTITCRASQDVS----TAVAWYQQKPGKAPKLLIYSASFLYSGVPSRFSGSG' +
        'SGTDFTLTISSLQPEDFATYYCQQHYTTPPTFGCGTKVEIKGGGGGSIEGR------SGGG-GSEVQLVE' +
        'SGGGLVQPGGSLRLSCAASGFTISD-YWIHWVRQAPGKCLEWVAGITPAGGYTYYADSVKGRFTISADTS' +
        'KNTAYLQMNSLRAEDTAVYYCARFVFFLP---YAMDYWGQGTLVTVS-'
    ),
    '3AUV_E': (
        '--QMTQSPSSLSASVGDRVTITCRASQDVS----TAVAWYQQKPGKAPKLLIYSASFLYSGVPSRFSGSG' +
        'SGTDFTLTISSLQPEDFATYYCQQHYTTPPTFGCGTKVEIKGGGGGSIEGR------SGGG-GSEVQLVE' +
        'SGGGLVQPGGSLRLSCAASGFTISD-YWIHWVRQAPGKCLEWVAGITPAGGYTYYADSVKGRFTISADTS' +
        'KNTAYLQMNSLRAEDTAVYYCARFVFFLP---YAMDYWGQGTLVTVS-'
    ),
    '6TCS_A': (
        '-IQLTQSPSSLSASVGDRVTITCRASQSVDYDGDSYMNWYQQKPGKAPKLLIYAASYLESGVPSRFSGSG' +
        'SGTDFTLTISSLQPEDFATYYCQQSHEDPYTFGCGTKVEIKRTGGGGSGGGGSGGGGSGGG-GSEVQLVE' +
        'SGGGLVQPGGSLRLSCAVSGYSITSGYSWNWIRQAPGKCLEWVASITYD-GSTNYNPSVKGRITISRDDS' +
        'KNTFYLQMNSLRAEDTAVYYCARGSHYFG--HWHFAVWGQGTLVTVS-'
    ),
    '5OGI_B': (
        '-IVMTQSPSSLSASVGDRVTITCKASQSVT----NDAAWYQKKPGKAPKLLIYQASTRYTGVPSRFSGSG' +
        'YGTDFTLTISSLQPEDFATYFCHQDYSSPLTFGQGTKVEIKRGGGGSGGGG------SGGG-GSQVQLVQ' +
        'SGAEDKKPGASVKVSCKVSGFSLGR-YGVHWVRQAPGQGLEWMGVIWRG-GTTDYNAKFQGRVTITKDDS' +
        'KSTVYMELSSLRSEDTAVYYCARQGSN-----FPLAYWGQGTLVTVS-'
    ),
    '6EQC_D': (
        '-IVMTQSPSSLSASVGDRVTITCKASQSVT----NDAAWYQKKPGKAPKLLIYQASTRYTGVPSRFSGSG' +
        'YGTDFTLTISSLQPEDFATYFCHQDYSSPLTFGQGTKVEIKRGGGGSGGGG------SGGG-GSQVQLVQ' +
        'SGAEDKKPGASVKVSCKVSGFSLGR-YGVHWVRQAPGQGLEWMGVIWRG-GTTDYNAKFQGRVTITKDDS' +
        'KSTVYMELSSLRSEDTAVYYCARQGSN-----FPLAYWGQGTLVTVS-'
    ),
}

orig_aln = hm.alignment.Alignment()

for key, value in aln_dict.items():
    orig_aln.sequences[key] = hm.alignment.Sequence(key, value)

gen = hm.alignment.AlignmentGenerator_hhblits.from_fasta('CD19.fa')
gen.alignment = copy.deepcopy(orig_aln)
gen.state['has_alignment'] = True

#gen.show_suggestion()
#gen.alignment.print_clustal(70)

###################################################################################
# start from here
template_location = './templates/'

def adjust_template_seq(seq_alignment: str,
                        seq_template: str):
    '''
    Adjusts the sequence from the template to the sequence present in
    the alignment

    Check match
    Pad right and left with missing residues
    '''
    # check if template sequence is in alignment sequence
    seq_match = re.search(seq_template.replace('X', r'\w'),
                          seq_alignment)

    if seq_match is None:
        raise RuntimeError(
            'Could not match template sequence with aligned sequence.')

    # pad template sequence if necessary
    seq_template_padded = (
            seq_template
            .rjust(len(seq_template) + seq_match.start(), 'X')
            .ljust(len(seq_alignment), 'X'))

    return seq_template_padded

def custom_renumber(pdb):
    '''
    Substracts 867 residues from every residue index > 1000
    '''
    output_lines = []
    for line in pdb.lines:
        if not (line.startswith('ATOM') or line.startswith('HETATM')):
            output_lines.append(line)
            continue

        curr_resSeq = int(line[22:26])
        if curr_resSeq > 1000:
            output_lines.append(line[:22] + str(curr_resSeq - 867).rjust(4) + line[26:])
        else:
            output_lines.append(line)

    return hm.pdb_io.PdbObject(output_lines)

def add_seq_to_aln(aln, seq_name, old_sequence, new_sequence):
    '''
    Adds sequence to alignment, then performs sequence replacement and
    annotates the sequence.
    '''
    annotations = {
        'seq_type': 'structure',
        'pdb_code': seq_name,
        'begin_res': '1',
        'begin_chain': 'A',
        }
    aln.sequences.update({
        seq_name: hm.alignment.Sequence(seq_name, old_sequence, **annotations)})
    aln.replace_sequence(seq_name, new_sequence)

    return aln

# download and process structure
pdb_manual = (
    hm.pdb_io.download_pdb('6TCS').
    transform_extract_chain('A').
    transform_remove_hetatm().
    transform_change_chain_id(new_chain_id='A')
)
# manual adjustment of residue numbers
pdb_manual = custom_renumber(pdb_manual)

# adjusting sequences based on what is aligned, what is the full sequence of the PDB, and what is in the template structure
seq_entity = 'DIQLTQSPSSLSASVGDRVTITCRASQSVDYDGDSYMNWYQQKPGKAPKLLIYAASYLESGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSHEDPYTFGCGTKVEIKRTGGGGSGGGGSGGGGSGGGGSEVQLVESGGGLVQPGGSLRLSCAVSGYSITSGYSWNWIRQAPGKCLEWVASITYDGSTNYNPSVKGRITISRDDSKNTFYLQMNSLRAEDTAVYYCARGSHYFGHWHFAVWGQGTLVTVSSENLYFQGSGGGGSHHHHHHHHHH'
seq_alignment = orig_aln.sequences['6TCS_A'].sequence.replace('-', '')
seq_template = pdb_manual.get_sequence(ignore_missing = False)
seq_template_padded = adjust_template_seq(seq_entity, seq_template)
# get indices for clipping sequence
index_first_res_clipped, index_last_res_clipped = re.search(seq_alignment, seq_entity).span()
# get first and last residue from template pdb
template_res_span = (
    int(pdb_manual.lines[0][22:26]),
    int(pdb_manual.lines[-1][22:26]))
# get index of first and last residue in alignment
for i, res in enumerate(seq_template_padded):
    if res != 'X':
        index_first_res_template = i
        break
for i, res in enumerate(seq_template_padded[::-1]):
    if res != 'X':
        index_last_res_template = len(seq_template_padded) - i
        break
# calculate residue range to clip from indices
lower = (template_res_span[0] + index_first_res_clipped -
         index_first_res_template)
upper = (template_res_span[1] + index_last_res_clipped -
         index_last_res_template)
# adjust template structure
pdb_manual = pdb_manual.transform_filter_res_seq(lower, upper)

# update alignment
gen.alignment = add_seq_to_aln(
    gen.alignment, f'6TCS_A',
    (orig_aln.sequences['6TCS_A'].sequence),
    (seq_template_padded[index_first_res_clipped:index_last_res_clipped]))

# finish processing template
pdb_manual = pdb_manual.transform_renumber_residues(1)

pdb_manual.write_pdb(os.path.join(template_location, '6TCS_A.pdb'))

#gen.show_suggestion()
#gen.alignment.print_clustal(70)

###################################################################################
# modelling is now possible

t = gen.initialize_task(task_name = 'gh_test', overwrite = True)
t.execute_routine(
    tag = f'test_6TCS',
    routine = hm.routines.Routine_automodel_default,
    templates = ['6TCS_A'],
    template_location = template_location
)

This became a bit more convoluted, it basically re-creates how get_pdbs works, with some manually adjustments sprinkled in. adjust_template_seq and add_seq_to_aln are helper functions of get_pdbs.

wt12318 commented 1 year ago

I have another question: why the 6LFV can not be searched in PDB70 database but can be searched in Swiss-Model (The pdb70_clu.tsv file shows that 6LFV is exist in PDB70 and it's sequence identity is high):

The pdb70_clu.tsv file: pdb70_clu.zip

The PDB70 database searched results:

import time
import argparse
import homelette as hm
import pandas as pd

##read fasta file
print("####read fasta file#####")
gen = hm.alignment.AlignmentGenerator_hhblits.from_fasta("fasta/CD19.fa")
print("#####search template######")
gen.get_suggestion(database_dir="/home/hhsuite_dbs/", n_threads=4, iterations = 4)
temp = gen.show_suggestion()
temp.to_csv("cd19_temp.csv")

cd19_temp.zip

The swiss-model result:

PhilippJunk commented 1 year ago

I can confirm that this is happening on my setup as well. It is not due to homelette parsing the hhblits output in a wrong way, but because hhblits, with the given parameters, does not include 6LFV in the alignment. I also can't find a parameter combination that will get hhblits to include the sequence.

If I had to guess, swiss-model is using different parameters for the hhblits search, performing different post-processing on it, or might even use their own database.

Please post new questions in the future in new issues, so that there is only one topic per issue.

PhilippJunk commented 1 year ago

From a practical perspective, what you can do is re-align your sequences after you have identified all templates of interest, and then use that alignment as a starting point.

In my opinion, considering that the multiple sequence alignment is a very essential part of the homology modelling process, automated template search and alignment generation is nice for initial exploration, and can be useful for automated pipelines, but sometimes it is useful to generate the alignment manually with well defined inputs.

wt12318 commented 1 year ago

Thank you for your response. I try: search templates, download selected template PDB and extract sequence, and the re-align by clustalo and the model from this alignment, but some error happened:

###search templates
import homelette as hm
##read fasta file
print("####read fasta file#####")
gen = hm.alignment.AlignmentGenerator_hhblits.from_fasta("fasta/CD19.fa")
print("#####search template######")
gen.get_suggestion(database_dir="./hhsuite_dbs/",n_threads=5)
gen.select_templates(list(temp.template)[0:9])
temp = list(gen.show_suggestion().template)
temp
['6LFW_A',
 '6LFX_A',
 '3ESU_F',
 '3ESV_G',
 '3AUV_D',
 '3AUV_E',
 '6TCS_A',
 '5OGI_B',
 '6EQC_D']

###download pdb and extarct sequence
import requests
import shutil
target = "cd19"
for i in temp:
    protein = i.split("_")[0].lower()
    URL2 = "https://files.rcsb.org/download/"+ protein +".pdb"
    response = requests.get(URL2)
    if (response.status_code == 200):
        pdb_file = open("templates/"+protein+".pdb", "ba")
        pdb_file.write(response.content)
        pdb_file.close()
        os.system('/home/pdb2fasta/pdb2fasta.sh templates/' + protein + '.pdb >> pdb2fa/' + target + '.fa')

###get needed fasta
f=open('pdb2fa/cd19.fa','r')
lines=f.readlines()
fa_name = []
for pro in temp:
    fa_name.extend([i for i, item in enumerate(lines) if re.search(pro, item)])
fa_index = [i + 1 for i in fa_name]

shutil.copy2("fasta/CD19.fa", "temp_pdb_seq/cd19_multi.fa")
ofile = open("temp_pdb_seq/cd19_multi.fa", "a")
for i in range(len(fa_index)):
    ofile.write(lines[fa_name[i]] + lines[fa_index[i]])
ofile.close()

###msa
os.system("./clustalo --in=temp_pdb_seq/cd19_multi.fa --out=aln/cd19_aln.fa") 

###model
gen2 = hm.alignment.AlignmentGenerator_from_aln(
        alignment_file = './aln/cd19_aln.fa',
        target = 'CD19')
gen2.show_suggestion()
gen2.get_pdbs()

Guessing template naming format...
Template naming format guessed: polymer_entity_instance!

Checking template dir...
Template dir found!

Processing templates:

Finishing... All templates successfully
downloaded and processed!
Templates can be found in
"./templates/".
/usr/local/src/homelette-1.3/homelette/alignment.py:1773: UserWarning: Entity: 3AUV_1
Found instances of the same PDB entity, but with different alignments. Since all instances of an entity share a sequence, this should not happen.
Automatically selected instance with highest sequence identity: 3AUV_D
To avoid this behaviour, please select one of the instances before running get_pdbs.
  warnings.warn(
/usr/local/src/homelette-1.3/homelette/alignment.py:1950: UserWarning: 6LFW_A
Could not find sequence from alignment in sequence of assigned entity. This can happen if the chains were reassigned in the PDB after the pdb70 database was created.
This sequence will be skipped. In order to use it,  please save your alignment and change the name of the sequence
  warnings.warn(msg)
/usr/local/src/homelette-1.3/homelette/alignment.py:1950: UserWarning: 6LFX_A
Could not find sequence from alignment in sequence of assigned entity. This can happen if the chains were reassigned in the PDB after the pdb70 database was created.
This sequence will be skipped. In order to use it,  please save your alignment and change the name of the sequence
  warnings.warn(msg)
/usr/local/src/homelette-1.3/homelette/alignment.py:1950: UserWarning: 3AUV_D
Could not find sequence from alignment in sequence of assigned entity. This can happen if the chains were reassigned in the PDB after the pdb70 database was created.
This sequence will be skipped. In order to use it,  please save your alignment and change the name of the sequence
  warnings.warn(msg)
/usr/local/src/homelette-1.3/homelette/alignment.py:1950: UserWarning: 6TCS_A
Could not find sequence from alignment in sequence of assigned entity. This can happen if the chains were reassigned in the PDB after the pdb70 database was created.
This sequence will be skipped. In order to use it,  please save your alignment and change the name of the sequence
  warnings.warn(msg)
/usr/local/src/homelette-1.3/homelette/alignment.py:1950: UserWarning: 5OGI_B
Could not find sequence from alignment in sequence of assigned entity. This can happen if the chains were reassigned in the PDB after the pdb70 database was created.
This sequence will be skipped. In order to use it,  please save your alignment and change the name of the sequence
  warnings.warn(msg)
/usr/local/src/homelette-1.3/homelette/alignment.py:1950: UserWarning: 6EQC_D
Could not find sequence from alignment in sequence of assigned entity. This can happen if the chains were reassigned in the PDB after the pdb70 database was created.
This sequence will be skipped. In order to use it,  please save your alignment and change the name of the sequence
  warnings.warn(msg)

The ./templates/ dir is empty.

PhilippJunk commented 1 year ago

Without knowing what /home/pdb2fasta/pdb2fasta.sh is really doing, it's kind of hard to diagnose the problem.

My guess would be that you are are getting sequences which already have the missing residues removed (as you would get from hm.pdb_io.PDBObject.get_sequence(ignore_missing=True), whereas get_pdbs expects to be able to match the sequences in the alignment to the complete respective entity sequence, or a continuous fragment of it, with missing residues denoted by the character X.

Instead of downloading the PDB files, and then extracting the sequences, you could also get the sequences directly: "https://www.rcsb.org/fasta/entry/"+ protein

wt12318 commented 1 year ago

the pdb2fasta is get from https://github.com/kad-ecoli/pdb2fasta/blob/master/pdb2fasta.sh

PhilippJunk commented 1 year ago

I don't really have the time to look at it in detail, but it seems to extract the sequence by parsing the PDB ATOM records. If a residue is missing in the structure, there is no ATOM record for it, therefore it will not show up in the sequence, therefore there will be a mis-match between the sequence from the PDB and the sequence in the structure, leading the the problem I outlined in my last comment.

If you want to use get_pdbs, you are better off aligning the sequences from the PDB, homelette is then taking care of looking at the structures and matching these up (as long as it can do this, as we can see from 6TCS it is not a foolproof way depending on what the PDB files gives us to work with).

PhilippJunk / homelette

Could not match sequence in alignment to sequence from structure #4