facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.28k stars 643 forks source link

[BUG] Fasta header with special characters alter output path #691

Open jspaezp opened 5 months ago

jspaezp commented 5 months ago

Bug description

Fasta headers with special characters generate inconsistent paths when using the extract entry point.

https://github.com/facebookresearch/esm/blob/2b369911bb5b4b0dda914521b9475cad1656b2ac/scripts/extract.py#L105C1-L105C67

Reproduction steps

If a fasta file with this input is used as an input:

>sp|Q99966|CITE1_HUMAN Cbp/p300-interacting transactivator 1 OS=Homo sapiens OX=9606 GN=CITED1 PE=1 SV=2
MPTTSRPALDVKGGTSPAKEDANQEMSSVAYSNLAVKDRKAVAILHYPGVASNGTKASGA
PTSSSGSPIGSPTTTPPTKPPSFNLHPAPHLLASMHLQKLNSQYQGMAAATPGQPGEAGP
LQNWDFGAQAGGAESLSPSAGAQSPAIIDSDPVDEEVLMSLVVELGLDRANELPELWLGQ
NEFDFTADFPSSC

Then it generates a pt like this -> embeds/sp|Q99966|CITE1_HUMAN Cbp/p300-interacting transactivator 1 OS=Homo sapiens OX=9606 GN=CITED1 PE=1 SV=2.pt

Expected behavior

The expected behavior would be that the special character is handled correctly and the output .pt is in the top level, and not broken into a directory.

GOT -> embeds / sp|Q99966|CITE1_HUMAN Cbp / p300-interacting transactivator 1 OS=Homo sapiens OX=9606 GN=CITED1 PE=1 SV=2.pt EXPECTED -> embeds / sp|Q99966|CITE1_HUMAN Cbp/p300-interacting transactivator 1 OS=Homo sapiens OX=9606 GN=CITED1 PE=1 SV=2.pt OR throw a warning and generate embeds / sp|Q99966|CITE1_HUMAN Cbp_p300-interacting transactivator 1 OS=Homo sapiens OX=9606 GN=CITED1 PE=1 SV=2.pt

LMK if you would like a PR for it!