Structurebiology-BNL / ESMBind

Deep learning + physical modeling for 3D protein metal ion binding prediction
Other
4 stars 2 forks source link

Inconsistent parsing ways for fasta ids in `get_esm*_embedding.py` and `data_process.py`. #1

Open alchemistcai opened 2 months ago

alchemistcai commented 2 months ago

In get_esm_embedding.py>process_fasta,get_esm_if_embedding.py>embedding and data_process.py>prep_test_dataset,utils.py>process_fasta_file functions,fasta ids are parsed like:

# in get_esm_embedding.py>process_fasta
ID_list.append(rec.id.split("|")[1])

# in get_esm_if_embedding.py>embedding
ids = [rec.id.split("|")[1] for rec in recs]
seqs = {rec.id.split("|")[1]: str(rec.seq) for rec in recs} 

# in data_process.py>prep_test_dataset
ID_list = [rec.id for rec in recs]

# in utils.py>process_fasta_file
for i in range(0, len(lines), 3): # hard code of fasta formats,not robust
    id = lines[i].strip().replace(">", "")

I use get_esm*_embedding.py to generate embedding (see.npy) from a fasta file like:

>|see   # `>sea` will lead to get_esm*_embedding.py raise IndexError: list index out of range
some sequence
>|sea
some sequence

When I use inference.py,the id is parsed as |sea and the script fails.I adjusts data_process.py to make it work.

I suggest:

empyriumz commented 2 months ago

Hi @alchemistcai,

Thanks for testing the code and pointing out inconsistencies. They resulted from handling different fasta ID conventions when I developed the pipeline. I will refactor the code to use a single function to parse the fasta file to be consistent.