bokulich-lab / RESCRIPt

REference Sequence annotation and CuRatIon Pipeline
BSD 3-Clause "New" or "Revised" License
89 stars 26 forks source link

ENH: enables `merge-taxa` and `dereplicate` to handle user-defined rank handles. #133

Closed mikerobeson closed 1 year ago

mikerobeson commented 2 years ago

This PR should allow merge-taxa and dereplicate to handle user-defined rank handles. This is an attempt to fix #122.

mikerobeson commented 1 year ago

I started to revisit this, at least the dereplicate.py code. I think I have an acceptable approach to this. I decided to normalize the taxonomy strings by allowing a user option (--p-coerce-sc-delim) for the semicolon. That is, they have a choice of ;, ;, or ;. I may change this to --p-semicolon-spacing, with options for space before, space after, both. 🤷‍♂️

This takes affect in several places, but primarily before any processing is done.

Anyway, I still need to add the other tests you requested, but I wanted to implement this first.

nbokulich commented 1 year ago

Hi @mikerobeson , I am not sure that we want to expose another parameter, and especially one that could lead to mistakes. Technically other delimiters could be passed (which could be useful in places, but could also cause other issues), and we create complication one way or another (what is a reasonable default? if none then users have another pesky param to pass every time they run this. if a default, then users may overlook, leading to other possible issues downstream).

I think I am fine with coercing a specific delimiter as you have done, but maybe we just always strip leading and tailing spaces — I think that's what we agreed on above?

mikerobeson commented 1 year ago

Alright. I removed the parameter option for the semicolons, and left a note in the help text to use rescript edit-taxonomy if the user would like to change the delimiter formatting. I also fixed / updated the test code. I'll work on adding the longer / shorter taxonomy string tests next.