ccdmb / predector

Effector prediction pipeline based on protein properties.
Apache License 2.0
11 stars 7 forks source link

FEATURE REQUEST: Sanitise PHI-base input #41

Closed darcyabjones closed 3 years ago

darcyabjones commented 3 years ago

PHI-base fastas sometimes have some weird characters in them that screw up the parsing of MMSeqs results. Should add a step to remove or replace non-UTF8 or non-ASCII characters before MMSeqs.

darcyabjones commented 3 years ago

I had some emails with the PHIbase team. It looks like the issue was with the particular encoding of the character. They said they'll be standardising from now on with ANSI only characters.

I think they specifically mean ASCII or extended ASCII given the validator they plan to use (https://onlineasciitools.com/validate-ascii) but i'll continue to monitor.

Update to current fasta file is apparently coming.

darcyabjones commented 3 years ago

We now delete any non-printable characters using sed. This appears to be enough for now.