bokulich-lab / RESCRIPt

REference Sequence annotation and CuRatIon Pipeline
BSD 3-Clause "New" or "Revised" License
89 stars 26 forks source link

BUG: fix 'Taxon' column to always be first after merge #113

Closed misialq closed 3 years ago

misialq commented 3 years ago

When the DataFrames get combined the columns will be sorted alphabetically, with the preference to first sort the capitalized column names and only then the other ones. That causes the column 'Taxa' to loose its position as a first column (required) when the other column names were provided as capitalized words.

Changes introduced in this PR allow to first capitalize all the column names to ensure consistent sorting behaviour during merge and then ensure that the 'Taxon' column is always first, regardless of the other columns. The tests were adjusted to reflect this behaviour.

Closes #112.

nbokulich commented 3 years ago

Thanks @misialq ! Why not just reorder so that Taxon is first, and skip the capitalization? Capitalizing is not a big change, but just seems unnecessary if you are still reordering.

misialq commented 3 years ago

That's true - it wouldn't be necessary here. My thinking was to just clean it up to prevent a similar issue from occurring somewhere down the line (not that I can think of any specific place; by similar issue I mean unexpected sorting). I can remove that part though if you think that's unlikely.

nbokulich commented 3 years ago

I suppose one possible future issue that it addresses is inconsistent capitalization (e.g., merging one df with consensus and another with Consensus). But this is probably low priority 🤷 as it should not occur now (theoretically it could if importing a taxonomy from elsewhere)

@mikerobeson what's your take on this? Is it more important to preserve original case or more important to prevent possible future bugs regarding capitalization inconsistency? I think I am happy either way, but I always just get a little bit nervous when altering original column names 😬

mikerobeson commented 3 years ago

@nbokulich, I am also on the fence about whether or not we should capitalize the column names. However, I would not be surprised, when merging taxonomies from different databases, that a something like this may crop up unexpectedly. I agree that an issue would more likely crop up form a user importing their own files, rather than anything we may parse via RESCRIPt. I'm leaning towards keeping the capitalization, as long as it does not require us to update all the test cases.

nbokulich commented 3 years ago

Thanks for the input @mikerobeson ! @misialq has already updated the unit tests to capitalize.

My gut feeling is that we should not alter the column names, but I see possible issues in either direction... @misialq I leave the final decision up to you 😁

misialq commented 3 years ago

Thanks @both, in that case I think I'll keep the capitalization and pray that the god of dataframes looks graciously upon this change.