Closed darcyabjones closed 3 years ago
I had some emails with the PHIbase team. It looks like the issue was with the particular encoding of the character. They said they'll be standardising from now on with ANSI only characters.
I think they specifically mean ASCII or extended ASCII given the validator they plan to use (https://onlineasciitools.com/validate-ascii) but i'll continue to monitor.
Update to current fasta file is apparently coming.
We now delete any non-printable characters using sed. This appears to be enough for now.
PHI-base fastas sometimes have some weird characters in them that screw up the parsing of MMSeqs results. Should add a step to remove or replace non-UTF8 or non-ASCII characters before MMSeqs.