Inconsistent parsing ways for fasta ids in `get_esm*_embedding.py` and `data_process.py`.

In get_esm_embedding.py>process_fasta,get_esm_if_embedding.py>embedding and data_process.py>prep_test_dataset,utils.py>process_fasta_file functions,fasta ids are parsed like:

# in get_esm_embedding.py>process_fasta
ID_list.append(rec.id.split("|")[1])

# in get_esm_if_embedding.py>embedding
ids = [rec.id.split("|")[1] for rec in recs]
seqs = {rec.id.split("|")[1]: str(rec.seq) for rec in recs} 

# in data_process.py>prep_test_dataset
ID_list = [rec.id for rec in recs]

# in utils.py>process_fasta_file
for i in range(0, len(lines), 3): # hard code of fasta formats,not robust
    id = lines[i].strip().replace(">", "")

I use get_esm*_embedding.py to generate embedding (see.npy) from a fasta file like:

>|see   # `>sea` will lead to get_esm*_embedding.py raise IndexError: list index out of range
some sequence
>|sea
some sequence

When I use inference.py,the id is parsed as |sea and the script fails.I adjusts data_process.py to make it work.

I suggest:

the scripts above use the same strategy to parse ids
refactor code to call only one function to keep consistent
define an optional argument key to pass a Callable object to let others decide how to parse rec.id,like python's list.sort(key=None).

Structurebiology-BNL / ESMBind

Inconsistent parsing ways for fasta ids in `get_esm*_embedding.py` and `data_process.py`. #1