blab / pathogen-embed

Create reduced dimension embeddings for pathogen sequences
https://pypi.org/project/pathogen-embed/
MIT License
1 stars 0 forks source link

Alert user when input files are not formatted as CSVs #15

Closed huddlej closed 4 months ago

huddlej commented 5 months ago

All of our input commands require CSV files, but we do not provide clear error messages or validation when users provide a TSV file instead. This specific issue happened after I tried to run pathogen-embed on a distance matrix produced by snp-dists which is a TSV file. Instead of providing an error when the input only had one column or a filename extension of ".tsv", pathogen-embed provided this obscure error message from sklearn's internals: ValueError: at least one array or dtype is required

We should at least check the filename extension or even use the built-in CSV sniffer to check whether the input files are formatted properly.