Closed xtrojak closed 2 years ago
Hi, I saw that you included a testthat case, that's great. So I guess that both old and new approach will pass the test and give identical results ? Yours, Steffen
Thanks Steffen - i appreciate you jumping in for this. I am new to this process! The 'identical results' question is the one i was to ask as well.
Hi. Yes, the integration test is passing in both versions.
I also refactored and extracted function for computing the similarity. We could write unit tests for this, although it is not easy to compare it with the old version.
I am not clear what the 'integration test' is testing, precisely. I also am not clear what you mean by 'refactored and extracted function for computing similarity'. The similarity score should not have to have been changed to make the integer.max problem go away?
Comparing with the old version:
It would be pretty easy to test the ultimate results:
Identical(rc.old$featclus, rc.new$featclus) Identical(rc.old$fmz, rc.new$fmz)
If those to things are identical, we can be pretty confident the rest is too.
@xtrojak i see the commit for testing:
expect_equal(readLines(expected_msp), readLines(actual_msp))
and it would certainly seem true that if the new .msp file is identical to the old .msp file that the clustering is identical. did your test compare the original vs the updated and passed?
@cbroeckl sorry if my answer was confusing. The expected output in "testdata/output.msp"
was created using the old version (currently in master
) and is being compared to a msp
file created using the new version in the test.
@xtrojak - thanks! i will merge, do some testing locally, and push the new version to CRAN. It may take me a couple of days.
@cbroeckl Thank you very much! Yes, we verify that if we re-write an algorithm or so that the outputs of the tool/algorithm don't change so that they are equivalent. Refactoring means changing code structure without changing function 👍
We already had tests in place for our Galaxy tool, so we basically took those and put them into the actual RAMClustR code to check if the results are consistent.
@xtrojak and @hechth - thanks for your contributions. I am glad that you are finding it valuable enough to continue to invest time into it!
This PR solves issue #19 - package
ff
fails on the allocation of the correlation matrix for large feature tables with dimensions greater than.Machine$integer.max
(details).The solution was to avoid allocating the whole matrix and thus using the
ff
package. The idea was to compute only intermediate matrices and store them in the output vector. More details of the algorithm before and after this PR are outlined in the picture below (sorry for the low quality, it's just a sketch).Also added simple integration test of main functionality and
conda
environment file for local testing.Close #19.