Then it generates a pt like this -> embeds/sp|Q99966|CITE1_HUMAN Cbp/p300-interacting transactivator 1 OS=Homo sapiens OX=9606 GN=CITED1 PE=1 SV=2.pt
Expected behavior
The expected behavior would be that the special character is handled correctly and the output .pt
is in the top level, and not broken into a directory.
Bug description
Fasta headers with special characters generate inconsistent paths when using the extract entry point.
https://github.com/facebookresearch/esm/blob/2b369911bb5b4b0dda914521b9475cad1656b2ac/scripts/extract.py#L105C1-L105C67
Reproduction steps
If a fasta file with this input is used as an input:
Then it generates a pt like this ->
embeds/sp|Q99966|CITE1_HUMAN Cbp/p300-interacting transactivator 1 OS=Homo sapiens OX=9606 GN=CITED1 PE=1 SV=2.pt
Expected behavior
The expected behavior would be that the special character is handled correctly and the output .pt is in the top level, and not broken into a directory.
GOT ->
embeds
/sp|Q99966|CITE1_HUMAN Cbp
/p300-interacting transactivator 1 OS=Homo sapiens OX=9606 GN=CITED1 PE=1 SV=2.pt
EXPECTED ->embeds
/sp|Q99966|CITE1_HUMAN Cbp/p300-interacting transactivator 1 OS=Homo sapiens OX=9606 GN=CITED1 PE=1 SV=2.pt
OR throw a warning and generateembeds
/sp|Q99966|CITE1_HUMAN Cbp_p300-interacting transactivator 1 OS=Homo sapiens OX=9606 GN=CITED1 PE=1 SV=2.pt
LMK if you would like a PR for it!