Closed wt12318 closed 1 year ago
I found the sequence from rcsb Sequence
panel : https://www.rcsb.org/sequence/6TCS (Download file--Download the fasta file) is the same of sequence got from get_entities_from_instances
(seq_entity), but the sequence got from PDB file (pdb_chain.get_sequence()
, seq_template) is different:
seq_template, seq_entity
('DIQLTQSPSSLSASVGDRVTITCRASQSVDYDGDSYMNWYQQKPGKAPKLLIYAASYLESGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSHEDPYTFGCGTKVEIKEVQLVESGGGLVQPGGSLRLSCAVSGYSITSGYSWNWIRQAPGKCLEWVASITYDGSTNYNPSVKGRITISRDDSKNTFYLQMNSLRAEDTAVYYCARGSHYFGHWHFAVWGQGTLVTVSS',
'DIQLTQSPSSLSASVGDRVTITCRASQSVDYDGDSYMNWYQQKPGKAPKLLIYAASYLESGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSHEDPYTFGCGTKVEIKRTGGGGSGGGGSGGGGSGGGGSEVQLVESGGGLVQPGGSLRLSCAVSGYSITSGYSWNWIRQAPGKCLEWVASITYDGSTNYNPSVKGRITISRDDSKNTFYLQMNSLRAEDTAVYYCARGSHYFGHWHFAVWGQGTLVTVSSENLYFQGSGGGGSHHHHHHHHHH')
###a was download from Sequence panel of RCSB
a = "DIQLTQSPSSLSASVGDRVTITCRASQSVDYDGDSYMNWYQQKPGKAPKLLIYAASYLESGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSHEDPYTFGCGTKVEIKRTGGGGSGGGGSGGGGSGGGGSEVQLVESGGGLVQPGGSLRLSCAVSGYSITSGYSWNWIRQAPGKCLEWVASITYDGSTNYNPSVKGRITISRDDSKNTFYLQMNSLRAEDTAVYYCARGSHYFGHWHFAVWGQGTLVTVSSENLYFQGSGGGGSHHHHHHHHHH"
a == seq_entity
True
Hi,
thanks for the interest in the package, you are not bothering me. It's nice to have someone vigorously test it.
That is an interesting bug you describe there. The way in which PDB files are weird never ceases to surprise me.
I can reconstruct what is happening, but I don't have a solution for it yet.
Internally, in get_pdbs
, homelette
calls the following methods:
import homelette as hm
pdb = hm.pdb_io.download_pdb('6TCS')
pdb_chain = pdb.transform_extract_chain('A')
seq_template = pdb_chain.get_sequence(ignore_missing=False)
The value of seq_template
is the following:
DIQLTQSPSSLSASVGDRVTITCRASQSVDYDGDSYMNWYQQKPGKAPKLLIYAASYLESGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSHEDPYTFGCGTKVEIKXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXEVQLVESGGGLVQPGGSLRLSCAVSGYSITSGYSWNWIRQAPGKCLEWVASITYDGSTNYNPSVKGRITISRDDSKNTFYLQMNSLRAEDTAVYYCARGSHYFGHWHFAVWGQGTLVTVSS
What is happening, is that in the PDB file, the amino acids are not sequentially numbered, but there is a jump from amino acid 111
to 1001
. Because of this, homelette
thinks that there are 890 missing residues in the middle and tries to match this to the sequence in the alignment, which obviously fails.
Inferring the length of sequence from the residue numbers in the PDB file sadly is kind of necessary for noticing missing residues in templates and correctly adapting the alignment to it.
I will think about it for a bit, managing your problem should be pretty straightforward and I will come back to you soon. Not sure however if I can catch that behavior on a global level.
The SEQRES
field of PDB file seems the same as the seq_entity and also the downloaded fasta from RCSB:
SEQRES 1 A 277 ASP ILE GLN LEU THR GLN SER PRO SER SER LEU SER ALA
SEQRES 2 A 277 SER VAL GLY ASP ARG VAL THR ILE THR CYS ARG ALA SER
SEQRES 3 A 277 GLN SER VAL ASP TYR ASP GLY ASP SER TYR MET ASN TRP
SEQRES 4 A 277 TYR GLN GLN LYS PRO GLY LYS ALA PRO LYS LEU LEU ILE
SEQRES 5 A 277 TYR ALA ALA SER TYR LEU GLU SER GLY VAL PRO SER ARG
SEQRES 6 A 277 PHE SER GLY SER GLY SER GLY THR ASP PHE THR LEU THR
SEQRES 7 A 277 ILE SER SER LEU GLN PRO GLU ASP PHE ALA THR TYR TYR
SEQRES 8 A 277 CYS GLN GLN SER HIS GLU ASP PRO TYR THR PHE GLY CYS
SEQRES 9 A 277 GLY THR LYS VAL GLU ILE LYS ARG THR GLY GLY GLY GLY
SEQRES 10 A 277 SER GLY GLY GLY GLY SER GLY GLY GLY GLY SER GLY GLY
SEQRES 11 A 277 GLY GLY SER GLU VAL GLN LEU VAL GLU SER GLY GLY GLY
SEQRES 12 A 277 LEU VAL GLN PRO GLY GLY SER LEU ARG LEU SER CYS ALA
SEQRES 13 A 277 VAL SER GLY TYR SER ILE THR SER GLY TYR SER TRP ASN
SEQRES 14 A 277 TRP ILE ARG GLN ALA PRO GLY LYS CYS LEU GLU TRP VAL
SEQRES 15 A 277 ALA SER ILE THR TYR ASP GLY SER THR ASN TYR ASN PRO
SEQRES 16 A 277 SER VAL LYS GLY ARG ILE THR ILE SER ARG ASP ASP SER
SEQRES 17 A 277 LYS ASN THR PHE TYR LEU GLN MET ASN SER LEU ARG ALA
SEQRES 18 A 277 GLU ASP THR ALA VAL TYR TYR CYS ALA ARG GLY SER HIS
SEQRES 19 A 277 TYR PHE GLY HIS TRP HIS PHE ALA VAL TRP GLY GLN GLY
SEQRES 20 A 277 THR LEU VAL THR VAL SER SER GLU ASN LEU TYR PHE GLN
SEQRES 21 A 277 GLY SER GLY GLY GLY GLY SER HIS HIS HIS HIS HIS HIS
SEQRES 22 A 277 HIS HIS HIS HIS
OK, it was a bit more complicated than I thought, since 6TCS
actually has both problems:
I am not 100% sure how I would use the sequence that should be there to construct a fool-proof logic of how the residues have to be re-arranged, but thanks for referring to the SEQRES entries.
I will leave this issue open for now and try to come up with an idea how I could catch this behavior automatically in the future
Now, for your problem, you could hack 6TCS
back into the alignment generator like this:
import copy
import os.path
import re
import homelette as hm
###################################################################################
# manually reconstructing the alignment generator in the state after hhblits search
aln_dict = {
'CD19': (
'DIQMTQTTSSLSASLGDRVTISCRASQDIS----KYLNWYQQKPDGTVKLLIYHTSRLHSGVPSRFSGSG' +
'SGTDYSLTISNLEQEDIATYFCQQGNTLPYTFGGGTKLEITGSTSGSGKPGS-----GEGSTKGEVKLQE' +
'SGPGLVAPSQSLSVTCTVSGVSLPD-YGVSWIRQPPRKGLEWLGVIWGS-ETTYYNSALKSRLTIIKDNS' +
'KSQVFLKMNSLQTDDTAIYYCAKHYYYGG--SYAMDYWGQGTSVTVSS'
),
'6LFW_A': (
'-IQMTQTASSLSASLGDRVTISCRASQYIN----NYLNWYQQKPDGTVTLLIYYTSILHSGVPSRFIGSG' +
'SGTDYSLTISNLDQEDIATYFCQQGYTLPLTFGAGTKLELKRPGGGGSGGGGSG---GGGS-VDEVKLVE' +
'SGGGLVKPGGSLKLSCAASGFTFST-YGMSWVRQTPEKRLEWVASISGG-GDTYYTDIVKGRVTISRDND' +
'RNILYLQMSSLRSEDTAMYHCTRGALVRLPHYYSMDYWGQGTSVTVS-'
),
'6LFX_A': (
'-IQMTQTASSLSASLGDRVTISCRASQYIN----NYLNWYQQKPDGTVTLLIYYTSILHSGVPSRFIGSG' +
'SGTDYSLTISNLDQEDIATYFCQQGYTLPLTFGAGTKLELKRPGGGGSGGGGSG---GGGS-VDEVKLVE' +
'SGGGLVKPGGSLKLSCAASGFTFST-YGMSWVRQTPEKRLEWVASISGG-GDTYYTDIVKGRVTISRDND' +
'RNILYLQMSSLRSEDTAMYHCTRGALVRLPHYYSMDYWGQGTSVTVS-'
),
'3ESU_F': (
'--VLIQSTSSLSASLGDRVTISCRASQDIR----NYLNWYQQKPDGTVKLLIYYTSRLQSGVPSRFSGSG' +
'SGTDYSLTISNLEQEDIGTYFCQQGNTLPWTFGGGTKLEIRRGGGGSGGGGSGGGG-SGGG-GSEVQLQQ' +
'SGPELVKPGASVKISCKDSGYAFSS-SWMNWVKQRPGQGPEWIGRIYPGDGDTNYNGKFKGKATLTADKS' +
'SSTAYMQLSSLTSVDSAVYFCARSGLLR----YAMDYWGQGTSVTVSS'
),
'3ESV_G': (
'--QMTQTTSSLSASLGDRVTVSCRASQDIR----NYLNWYQQKPDGTVKFLIYYTSRLQPGVPSRFSGSG' +
'SGTDYSLTINNLEQEDIGTYFCQQGNTPPWTFGGGTKLEIKRGGGGSGGGGSGGGG-SGGG-GSEVQLQQ' +
'SGPELVKPGASVKISCKDSGYAFNS-SWMNWVKQRPGQGLEWIGRIYPGDGDSNYNGKFEGKAILTADKS' +
'SSTAYMQLSSLTSVDSAVYFCARSGLLR----YAMDYWGQGTSVTVSS'
),
'3AUV_D': (
'--QMTQSPSSLSASVGDRVTITCRASQDVS----TAVAWYQQKPGKAPKLLIYSASFLYSGVPSRFSGSG' +
'SGTDFTLTISSLQPEDFATYYCQQHYTTPPTFGCGTKVEIKGGGGGSIEGR------SGGG-GSEVQLVE' +
'SGGGLVQPGGSLRLSCAASGFTISD-YWIHWVRQAPGKCLEWVAGITPAGGYTYYADSVKGRFTISADTS' +
'KNTAYLQMNSLRAEDTAVYYCARFVFFLP---YAMDYWGQGTLVTVS-'
),
'3AUV_E': (
'--QMTQSPSSLSASVGDRVTITCRASQDVS----TAVAWYQQKPGKAPKLLIYSASFLYSGVPSRFSGSG' +
'SGTDFTLTISSLQPEDFATYYCQQHYTTPPTFGCGTKVEIKGGGGGSIEGR------SGGG-GSEVQLVE' +
'SGGGLVQPGGSLRLSCAASGFTISD-YWIHWVRQAPGKCLEWVAGITPAGGYTYYADSVKGRFTISADTS' +
'KNTAYLQMNSLRAEDTAVYYCARFVFFLP---YAMDYWGQGTLVTVS-'
),
'6TCS_A': (
'-IQLTQSPSSLSASVGDRVTITCRASQSVDYDGDSYMNWYQQKPGKAPKLLIYAASYLESGVPSRFSGSG' +
'SGTDFTLTISSLQPEDFATYYCQQSHEDPYTFGCGTKVEIKRTGGGGSGGGGSGGGGSGGG-GSEVQLVE' +
'SGGGLVQPGGSLRLSCAVSGYSITSGYSWNWIRQAPGKCLEWVASITYD-GSTNYNPSVKGRITISRDDS' +
'KNTFYLQMNSLRAEDTAVYYCARGSHYFG--HWHFAVWGQGTLVTVS-'
),
'5OGI_B': (
'-IVMTQSPSSLSASVGDRVTITCKASQSVT----NDAAWYQKKPGKAPKLLIYQASTRYTGVPSRFSGSG' +
'YGTDFTLTISSLQPEDFATYFCHQDYSSPLTFGQGTKVEIKRGGGGSGGGG------SGGG-GSQVQLVQ' +
'SGAEDKKPGASVKVSCKVSGFSLGR-YGVHWVRQAPGQGLEWMGVIWRG-GTTDYNAKFQGRVTITKDDS' +
'KSTVYMELSSLRSEDTAVYYCARQGSN-----FPLAYWGQGTLVTVS-'
),
'6EQC_D': (
'-IVMTQSPSSLSASVGDRVTITCKASQSVT----NDAAWYQKKPGKAPKLLIYQASTRYTGVPSRFSGSG' +
'YGTDFTLTISSLQPEDFATYFCHQDYSSPLTFGQGTKVEIKRGGGGSGGGG------SGGG-GSQVQLVQ' +
'SGAEDKKPGASVKVSCKVSGFSLGR-YGVHWVRQAPGQGLEWMGVIWRG-GTTDYNAKFQGRVTITKDDS' +
'KSTVYMELSSLRSEDTAVYYCARQGSN-----FPLAYWGQGTLVTVS-'
),
}
orig_aln = hm.alignment.Alignment()
for key, value in aln_dict.items():
orig_aln.sequences[key] = hm.alignment.Sequence(key, value)
gen = hm.alignment.AlignmentGenerator_hhblits.from_fasta('CD19.fa')
gen.alignment = copy.deepcopy(orig_aln)
gen.state['has_alignment'] = True
#gen.show_suggestion()
#gen.alignment.print_clustal(70)
###################################################################################
# start from here
template_location = './templates/'
def adjust_template_seq(seq_alignment: str,
seq_template: str):
'''
Adjusts the sequence from the template to the sequence present in
the alignment
Check match
Pad right and left with missing residues
'''
# check if template sequence is in alignment sequence
seq_match = re.search(seq_template.replace('X', r'\w'),
seq_alignment)
if seq_match is None:
raise RuntimeError(
'Could not match template sequence with aligned sequence.')
# pad template sequence if necessary
seq_template_padded = (
seq_template
.rjust(len(seq_template) + seq_match.start(), 'X')
.ljust(len(seq_alignment), 'X'))
return seq_template_padded
def custom_renumber(pdb):
'''
Substracts 867 residues from every residue index > 1000
'''
output_lines = []
for line in pdb.lines:
if not (line.startswith('ATOM') or line.startswith('HETATM')):
output_lines.append(line)
continue
curr_resSeq = int(line[22:26])
if curr_resSeq > 1000:
output_lines.append(line[:22] + str(curr_resSeq - 867).rjust(4) + line[26:])
else:
output_lines.append(line)
return hm.pdb_io.PdbObject(output_lines)
def add_seq_to_aln(aln, seq_name, old_sequence, new_sequence):
'''
Adds sequence to alignment, then performs sequence replacement and
annotates the sequence.
'''
annotations = {
'seq_type': 'structure',
'pdb_code': seq_name,
'begin_res': '1',
'begin_chain': 'A',
}
aln.sequences.update({
seq_name: hm.alignment.Sequence(seq_name, old_sequence, **annotations)})
aln.replace_sequence(seq_name, new_sequence)
return aln
# download and process structure
pdb_manual = (
hm.pdb_io.download_pdb('6TCS').
transform_extract_chain('A').
transform_remove_hetatm().
transform_change_chain_id(new_chain_id='A')
)
# manual adjustment of residue numbers
pdb_manual = custom_renumber(pdb_manual)
# adjusting sequences based on what is aligned, what is the full sequence of the PDB, and what is in the template structure
seq_entity = 'DIQLTQSPSSLSASVGDRVTITCRASQSVDYDGDSYMNWYQQKPGKAPKLLIYAASYLESGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSHEDPYTFGCGTKVEIKRTGGGGSGGGGSGGGGSGGGGSEVQLVESGGGLVQPGGSLRLSCAVSGYSITSGYSWNWIRQAPGKCLEWVASITYDGSTNYNPSVKGRITISRDDSKNTFYLQMNSLRAEDTAVYYCARGSHYFGHWHFAVWGQGTLVTVSSENLYFQGSGGGGSHHHHHHHHHH'
seq_alignment = orig_aln.sequences['6TCS_A'].sequence.replace('-', '')
seq_template = pdb_manual.get_sequence(ignore_missing = False)
seq_template_padded = adjust_template_seq(seq_entity, seq_template)
# get indices for clipping sequence
index_first_res_clipped, index_last_res_clipped = re.search(seq_alignment, seq_entity).span()
# get first and last residue from template pdb
template_res_span = (
int(pdb_manual.lines[0][22:26]),
int(pdb_manual.lines[-1][22:26]))
# get index of first and last residue in alignment
for i, res in enumerate(seq_template_padded):
if res != 'X':
index_first_res_template = i
break
for i, res in enumerate(seq_template_padded[::-1]):
if res != 'X':
index_last_res_template = len(seq_template_padded) - i
break
# calculate residue range to clip from indices
lower = (template_res_span[0] + index_first_res_clipped -
index_first_res_template)
upper = (template_res_span[1] + index_last_res_clipped -
index_last_res_template)
# adjust template structure
pdb_manual = pdb_manual.transform_filter_res_seq(lower, upper)
# update alignment
gen.alignment = add_seq_to_aln(
gen.alignment, f'6TCS_A',
(orig_aln.sequences['6TCS_A'].sequence),
(seq_template_padded[index_first_res_clipped:index_last_res_clipped]))
# finish processing template
pdb_manual = pdb_manual.transform_renumber_residues(1)
pdb_manual.write_pdb(os.path.join(template_location, '6TCS_A.pdb'))
#gen.show_suggestion()
#gen.alignment.print_clustal(70)
###################################################################################
# modelling is now possible
t = gen.initialize_task(task_name = 'gh_test', overwrite = True)
t.execute_routine(
tag = f'test_6TCS',
routine = hm.routines.Routine_automodel_default,
templates = ['6TCS_A'],
template_location = template_location
)
This became a bit more convoluted, it basically re-creates how get_pdbs
works, with some manually adjustments sprinkled in. adjust_template_seq
and add_seq_to_aln
are helper functions of get_pdbs
.
I have another question: why the 6LFV
can not be searched in PDB70 database but can be searched in Swiss-Model (The pdb70_clu.tsv file shows that 6LFV is exist in PDB70 and it's sequence identity is high):
The pdb70_clu.tsv file: pdb70_clu.zip
The PDB70 database searched results:
import time
import argparse
import homelette as hm
import pandas as pd
##read fasta file
print("####read fasta file#####")
gen = hm.alignment.AlignmentGenerator_hhblits.from_fasta("fasta/CD19.fa")
print("#####search template######")
gen.get_suggestion(database_dir="/home/hhsuite_dbs/", n_threads=4, iterations = 4)
temp = gen.show_suggestion()
temp.to_csv("cd19_temp.csv")
The swiss-model result:
I can confirm that this is happening on my setup as well. It is not due to homelette
parsing the hhblits
output in a wrong way, but because hhblits
, with the given parameters, does not include 6LFV
in the alignment. I also can't find a parameter combination that will get hhblits
to include the sequence.
If I had to guess, swiss-model is using different parameters for the hhblits
search, performing different post-processing on it, or might even use their own database.
Please post new questions in the future in new issues, so that there is only one topic per issue.
From a practical perspective, what you can do is re-align your sequences after you have identified all templates of interest, and then use that alignment as a starting point.
In my opinion, considering that the multiple sequence alignment is a very essential part of the homology modelling process, automated template search and alignment generation is nice for initial exploration, and can be useful for automated pipelines, but sometimes it is useful to generate the alignment manually with well defined inputs.
Thank you for your response. I try: search templates, download selected template PDB and extract sequence, and the re-align by clustalo and the model from this alignment, but some error happened:
###search templates
import homelette as hm
##read fasta file
print("####read fasta file#####")
gen = hm.alignment.AlignmentGenerator_hhblits.from_fasta("fasta/CD19.fa")
print("#####search template######")
gen.get_suggestion(database_dir="./hhsuite_dbs/",n_threads=5)
gen.select_templates(list(temp.template)[0:9])
temp = list(gen.show_suggestion().template)
temp
['6LFW_A',
'6LFX_A',
'3ESU_F',
'3ESV_G',
'3AUV_D',
'3AUV_E',
'6TCS_A',
'5OGI_B',
'6EQC_D']
###download pdb and extarct sequence
import requests
import shutil
target = "cd19"
for i in temp:
protein = i.split("_")[0].lower()
URL2 = "https://files.rcsb.org/download/"+ protein +".pdb"
response = requests.get(URL2)
if (response.status_code == 200):
pdb_file = open("templates/"+protein+".pdb", "ba")
pdb_file.write(response.content)
pdb_file.close()
os.system('/home/pdb2fasta/pdb2fasta.sh templates/' + protein + '.pdb >> pdb2fa/' + target + '.fa')
###get needed fasta
f=open('pdb2fa/cd19.fa','r')
lines=f.readlines()
fa_name = []
for pro in temp:
fa_name.extend([i for i, item in enumerate(lines) if re.search(pro, item)])
fa_index = [i + 1 for i in fa_name]
shutil.copy2("fasta/CD19.fa", "temp_pdb_seq/cd19_multi.fa")
ofile = open("temp_pdb_seq/cd19_multi.fa", "a")
for i in range(len(fa_index)):
ofile.write(lines[fa_name[i]] + lines[fa_index[i]])
ofile.close()
###msa
os.system("./clustalo --in=temp_pdb_seq/cd19_multi.fa --out=aln/cd19_aln.fa")
###model
gen2 = hm.alignment.AlignmentGenerator_from_aln(
alignment_file = './aln/cd19_aln.fa',
target = 'CD19')
gen2.show_suggestion()
gen2.get_pdbs()
Guessing template naming format...
Template naming format guessed: polymer_entity_instance!
Checking template dir...
Template dir found!
Processing templates:
Finishing... All templates successfully
downloaded and processed!
Templates can be found in
"./templates/".
/usr/local/src/homelette-1.3/homelette/alignment.py:1773: UserWarning: Entity: 3AUV_1
Found instances of the same PDB entity, but with different alignments. Since all instances of an entity share a sequence, this should not happen.
Automatically selected instance with highest sequence identity: 3AUV_D
To avoid this behaviour, please select one of the instances before running get_pdbs.
warnings.warn(
/usr/local/src/homelette-1.3/homelette/alignment.py:1950: UserWarning: 6LFW_A
Could not find sequence from alignment in sequence of assigned entity. This can happen if the chains were reassigned in the PDB after the pdb70 database was created.
This sequence will be skipped. In order to use it, please save your alignment and change the name of the sequence
warnings.warn(msg)
/usr/local/src/homelette-1.3/homelette/alignment.py:1950: UserWarning: 6LFX_A
Could not find sequence from alignment in sequence of assigned entity. This can happen if the chains were reassigned in the PDB after the pdb70 database was created.
This sequence will be skipped. In order to use it, please save your alignment and change the name of the sequence
warnings.warn(msg)
/usr/local/src/homelette-1.3/homelette/alignment.py:1950: UserWarning: 3AUV_D
Could not find sequence from alignment in sequence of assigned entity. This can happen if the chains were reassigned in the PDB after the pdb70 database was created.
This sequence will be skipped. In order to use it, please save your alignment and change the name of the sequence
warnings.warn(msg)
/usr/local/src/homelette-1.3/homelette/alignment.py:1950: UserWarning: 6TCS_A
Could not find sequence from alignment in sequence of assigned entity. This can happen if the chains were reassigned in the PDB after the pdb70 database was created.
This sequence will be skipped. In order to use it, please save your alignment and change the name of the sequence
warnings.warn(msg)
/usr/local/src/homelette-1.3/homelette/alignment.py:1950: UserWarning: 5OGI_B
Could not find sequence from alignment in sequence of assigned entity. This can happen if the chains were reassigned in the PDB after the pdb70 database was created.
This sequence will be skipped. In order to use it, please save your alignment and change the name of the sequence
warnings.warn(msg)
/usr/local/src/homelette-1.3/homelette/alignment.py:1950: UserWarning: 6EQC_D
Could not find sequence from alignment in sequence of assigned entity. This can happen if the chains were reassigned in the PDB after the pdb70 database was created.
This sequence will be skipped. In order to use it, please save your alignment and change the name of the sequence
warnings.warn(msg)
The ./templates/
dir is empty.
Without knowing what /home/pdb2fasta/pdb2fasta.sh
is really doing, it's kind of hard to diagnose the problem.
My guess would be that you are are getting sequences which already have the missing residues removed (as you would get from hm.pdb_io.PDBObject.get_sequence(ignore_missing=True)
, whereas get_pdbs
expects to be able to match the sequences in the alignment to the complete respective entity sequence, or a continuous fragment of it, with missing residues denoted by the character X
.
Instead of downloading the PDB files, and then extracting the sequences, you could also get the sequences directly: "https://www.rcsb.org/fasta/entry/"+ protein
the pdb2fasta
is get from https://github.com/kad-ecoli/pdb2fasta/blob/master/pdb2fasta.sh
I don't really have the time to look at it in detail, but it seems to extract the sequence by parsing the PDB ATOM records. If a residue is missing in the structure, there is no ATOM record for it, therefore it will not show up in the sequence, therefore there will be a mis-match between the sequence from the PDB and the sequence in the structure, leading the the problem I outlined in my last comment.
If you want to use get_pdbs
, you are better off aligning the sequences from the PDB, homelette
is then taking care of looking at the structures and matching these up (as long as it can do this, as we can see from 6TCS
it is not a foolproof way depending on what the PDB files gives us to work with).
Hi,
Sorry to bother again. I used the following code to do template search:
When I got PDB, it show warning and the
6TCS_A
was lost:How to handle this problem (I want to include the 6TCS_A). Thanks.