facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.29k stars 645 forks source link

name of protein != entirety of header #410

Open flowers9 opened 1 year ago

flowers9 commented 1 year ago

Sometimes fasta files have simple header lines, but sometimes they toss in a bunch of stuff, for example:

Seita.9G099800.1.p pacid=32690791 transcript=Seita.9G099800.1 locus=Seita.9G099800 ID=Seita.9G099800.1.v2.2 annot-version=v2.2

This is not a good name for the output file, and can make the logs a bit wordy, too. I'd suggest trimming it down to the first whitespace delimited word. In esm/scripts/esmfold_inference.py:

import re regex = re.compile(r'\S+') name = regex.match(header).group(0)

And then use {name} instead of {header} as appropriate below.

tomsercu commented 1 year ago

~I assume this issue refers to extracting embeddings and/or predicting structures?~ I see now that this refers to esm/scripts/esmfold_inference.py. This proposal makes sense; we'd welcome a PR!

EDIT: I think what you want to achieve could simply be name = header.split()[0]