name of protein != entirety of header

facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

MIT License

3.29k stars 645 forks source link

Sometimes fasta files have simple header lines, but sometimes they toss in a bunch of stuff, for example:

Seita.9G099800.1.p pacid=32690791 transcript=Seita.9G099800.1 locus=Seita.9G099800 ID=Seita.9G099800.1.v2.2 annot-version=v2.2

This is not a good name for the output file, and can make the logs a bit wordy, too. I'd suggest trimming it down to the first whitespace delimited word. In esm/scripts/esmfold_inference.py:

import re regex = re.compile(r'\S+') name = regex.match(header).group(0)

And then use {name} instead of {header} as appropriate below.

facebookresearch / esm

name of protein != entirety of header #410