gamcil / clinker

Gene cluster comparison figure generator
MIT License
518 stars 69 forks source link

Align Error: seq contains letters not in the alphabet #70

Closed LeoBusse closed 3 years ago

LeoBusse commented 3 years ago

Hello, thank you for developing a great tool!

I've been trying to figure out where the "ValueError: sequence contains letters not in the alphabet" error is coming from when I run my .gbf files/.gb files through Clinker. I went through issue #68 and I installed Clinker 0.0.21 through Conda again but to no avail. I have also tried the pip install but that didn't fix the problem. I double checked the align.py script on my local computer and it has the extend_matrix_alphabet addition, so I'm not sure what to do. You mentioned a quick fix would be to go through the sequence and delete anything not part of the extended IUPAC. Is there a particular way you recommend doing this? I have several sequences, so it seems like it would take a long time to identify anything wrong in the sequence (I would be looking for numbers, right?).

I attached an image with the traceback in case it's helpful.

Thank you so much!

Screen Shot 2021-06-02 at 6 05 50 PM
gamcil commented 3 years ago

If your image is anything to go by, looks like your sequences have gaps (B starts with -) in them - I'll have to add them to the extended set. In the meantime, you could do a search and replace for the gap characters with X, or just delete them (it might only be a few rogue sequences, which is usually the case). I'll try and get a fix for this soon.

Looks like I also forgot to remove some logging calls there so thanks for reminding me haha.

LeoBusse commented 3 years ago

Thank you so much!

I looked for the gaps as suggested and it works perfectly now! I really appreciate the advice.