Closed mikerobeson closed 1 year ago
I started to revisit this, at least the dereplicate.py
code. I think I have an acceptable approach to this. I decided to normalize the taxonomy strings by allowing a user option (--p-coerce-sc-delim
) for the semicolon. That is, they have a choice of ;
, ;
, or ;
. I may change this to --p-semicolon-spacing
, with options for space before
, space after
, both
. 🤷♂️
This takes affect in several places, but primarily before any processing is done.
Anyway, I still need to add the other tests you requested, but I wanted to implement this first.
Hi @mikerobeson , I am not sure that we want to expose another parameter, and especially one that could lead to mistakes. Technically other delimiters could be passed (which could be useful in places, but could also cause other issues), and we create complication one way or another (what is a reasonable default? if none then users have another pesky param to pass every time they run this. if a default, then users may overlook, leading to other possible issues downstream).
I think I am fine with coercing a specific delimiter as you have done, but maybe we just always strip leading and tailing spaces — I think that's what we agreed on above?
Alright. I removed the parameter option for the semicolons, and left a note in the help text to use rescript edit-taxonomy
if the user would like to change the delimiter formatting. I also fixed / updated the test code. I'll work on adding the longer / shorter taxonomy string tests next.
This PR should allow
merge-taxa
anddereplicate
to handle user-defined rank handles. This is an attempt to fix #122.merge-taxa
:--p-new-rank-handles
: is now a list of allowed ranks (e.g.parse-silva-taxonomy
,get-ncbi-data
)--p-apply-new-rank-handles
: is a boolean to apply--p-new-rank-handles
dereplicate
:--p-rank-handles
: is now a list of allowed ranks (e.g.parse-silva-taxonomy
,get-ncbi-data
)--p-backfill
: is a boolean to apply--p-rank-handles
To do:
parse-silva-taxonomy
,get-ncbi-data