blab / pathogen-embed

Create reduced dimension embeddings for pathogen sequences
https://pypi.org/project/pathogen-embed/
MIT License
1 stars 0 forks source link

Replace internal sort of alignments and distance matrices with a check for consistent record order across inputs #28

Closed huddlej closed 1 month ago

huddlej commented 1 month ago

In #19, I introduced sorting of alignment sequences and distance matrices by strain name within the pathogen-embed command. By sorting these inputs, I hoped to preempt mismatched strain name orders in multiple user-provided inputs to the same field. For example, two input distance matrices could have strain names in different orders and, to merge their distances, we have to put them in a consistent order. The same goes for multiple input alignments that might have strains in different orders.

In practice, it is easy to accidentally provide inputs with records in different orders. However, it is also easy to generate distances from an alignment with pathogen-distance where the records have one order (unsorted by this command) and embedding output from pathogen-embed where the records have another order (sorted by this command). This internal sorting of records in the embedding output can be surprising to users who expect the output order to match the input order for their downstream analyses (e.g., me).

An alternative to internal sorting and one that @nandsra21 and I discussed originally would be to check whether inputs are in the same order or not and throw an error when they are not and need to be. The error message could provide an example command for how users could sort their inputs using standard tools like seqkit and csvtk.