Open GregJohnsonJr opened 6 days ago
Lingering issues...
read_distance()
sparse_count_table <- read_count(example_path("amazon.sparse.count_table"))
cluster_metrics
and iteration_metrics
to be "iter time label num_otus cutoff tp tn fp fn sensitivity specificity ppv npv fdr accuracy mcc f1score". Why is time value so large?label
to cutoff
throughout packageget_clusters
to get_bins
get_shared
to get_abundance
cluster
. Default to 0.03 for all the things.
A few number of things came up as I was going through the vignette. I've fixed a number of things with this PR, but take a look to see what I've done. Most of my code fixes are hacks on the R end. You may want to correct on the RCpp end. The tests are failing and need to be fixed...
get_distance
, but I think the function is actuallyget_distance_data_frame
. Can we make itget_distance_df
?get_distance
doesn't appear to actually be removing values below the cutoff, at least as written in the vignette. Bothget_distance_data_frame(column_distance)
andget_distance_data_frame(phylip_distance)
have values larger than 0.03 and I see -1 values. They both have self comparisons as well as the upper triangle of the matrix (there are 9604 values = 98 * 98).cluster()
had "OTU98" in one data frame and "otu98" in another data frame. I hacked a solution incluster.R
to make everything upper case. I think I'd rather it all be in lower case, but regardless of the case, the two data frames need to be the samecluster()
and doesn't need to be included in the data frames. Again I added a hack to get this how I'd like to see it. There's probably a more elegant way to do this in the Rcpp code. With my hack the hierarchical methods all return a cutoff of 0.00 rather than 0.03. Looking at thecluster_dfs
object it appears the first data frame doesn't have the label, but the second does. I'm grabbing it from the firstcluster_dfs
incluster()
the second data frame has abins
column, this should besequences
get_cutoff()
should likely beget_label()
to be consistent with how we call things in mothur. I changed this in the code/documentation, but it probably broke stuffother_metrics
field incluster_dfs
to beiteration_metrics
.cluster_metrics
anditeration_metrics
is giving garbage results when using a cutoff of 0.03read_dist()
cluster()
needs a cutoff field if it isn't given inread_distance()
. An example would be if someone reads in the full distance matrix, but wants a specific cutoff. This would likely be needed with average neighbor where someone would read in with a cutoff of 0.20, but cluster to 0.03.method = 'neighbor'
it doesn't error out or do anythingA more general question is what we should be calling things. We have label and cutoff, abundance and shared, and sequences/bins/otus. I see my own inconsistency in all this :) Now that I think about it, let's leave it as cutoff rather than label. I'm not sure what to do about sequences/bins/otus. For 16S/mothur we would use sequence/otus. For mums2 we'd want features/omus. A more generic scheme could be features/bins. Let me think more on this...