As per PR #38 - Githubissues

A few number of things came up as I was going through the vignette. I've fixed a number of things with this PR, but take a look to see what I've done. Most of my code fixes are hacks on the R end. You may want to correct on the RCpp end. The tests are failing and need to be fixed...

[x] Change repo name to "clustur" (all lower case)
[x] Vignette has get_distance, but I think the function is actually get_distance_data_frame. Can we make it get_distance_df?
[x] get_distance doesn't appear to actually be removing values below the cutoff, at least as written in the vignette. Both get_distance_data_frame(column_distance) and get_distance_data_frame(phylip_distance) have values larger than 0.03 and I see -1 values. They both have self comparisons as well as the upper triangle of the matrix (there are 9604 values = 98 * 98).
[x] The output of cluster() had "OTU98" in one data frame and "otu98" in another data frame. I hacked a solution in cluster.R to make everything upper case. I think I'd rather it all be in lower case, but regardless of the case, the two data frames need to be the same
[x] The label should be its own field in the list coming out of cluster() and doesn't need to be included in the data frames. Again I added a hack to get this how I'd like to see it. There's probably a more elegant way to do this in the Rcpp code. With my hack the hierarchical methods all return a cutoff of 0.00 rather than 0.03. Looking at the cluster_dfs object it appears the first data frame doesn't have the label, but the second does. I'm grabbing it from the first
[x] Leave both data frames ordered by the OTU number, not the count. I commented out that code, but it needs to be removed.
[x] In cluster_dfs in cluster() the second data frame has a bins column, this should be sequences
[x] get_cutoff() should likely be get_label() to be consistent with how we call things in mothur. I changed this in the code/documentation, but it probably broke stuff
[x] I changed the other_metrics field in cluster_dfs to be iteration_metrics.
[x] I think cluster_metrics and iteration_metrics is giving garbage results when using a cutoff of 0.03
[x] Don't need the "This is a column file. Processing now..." type output from read_dist()
[x] I feel like cluster() needs a cutoff field if it isn't given in read_distance(). An example would be if someone reads in the full distance matrix, but wants a specific cutoff. This would likely be needed with average neighbor where someone would read in with a cutoff of 0.20, but cluster to 0.03.
[x] There isn't any checking to make sure values are valid. E.g., if I put in method = 'neighbor' it doesn't error out or do anything
[x] Finishing lingering issues

A more general question is what we should be calling things. We have label and cutoff, abundance and shared, and sequences/bins/otus. I see my own inconsistency in all this :) Now that I think about it, let's leave it as cutoff rather than label. I'm not sure what to do about sequences/bins/otus. For 16S/mothur we would use sequence/otus. For mums2 we'd want features/omus. A more generic scheme could be features/bins. Let me think more on this...

SchlossLab / clustur

As per PR #38 #39