Support large feature tables - Githubissues

cbroeckl / RAMClustR

Assigning precursor-product ion relationships in indiscriminant MS/MS data

MIT License

12 stars 16 forks source link

Support large feature tables #20

Closed xtrojak closed 2 years ago

xtrojak commented 2 years ago

This PR solves issue #19 - package ff fails on the allocation of the correlation matrix for large feature tables with dimensions greater than .Machine$integer.max (details).

The solution was to avoid allocating the whole matrix and thus using the ff package. The idea was to compute only intermediate matrices and store them in the output vector. More details of the algorithm before and after this PR are outlined in the picture below (sorry for the low quality, it's just a sketch).

Also added simple integration test of main functionality and conda environment file for local testing.

Close #19.

Notes

sneumann commented 2 years ago

Hi, I saw that you included a testthat case, that's great. So I guess that both old and new approach will pass the test and give identical results ? Yours, Steffen

cbroeckl commented 2 years ago

Thanks Steffen - i appreciate you jumping in for this. I am new to this process! The 'identical results' question is the one i was to ask as well.

xtrojak commented 2 years ago

Hi. Yes, the integration test is passing in both versions.

I also refactored and extracted function for computing the similarity. We could write unit tests for this, although it is not easy to compare it with the old version.

cbroeckl commented 2 years ago

I am not clear what the 'integration test' is testing, precisely. I also am not clear what you mean by 'refactored and extracted function for computing similarity'. The similarity score should not have to have been changed to make the integer.max problem go away?

Comparing with the old version:

It would be pretty easy to test the ultimate results:

Identical(rc.old$featclus, rc.new$featclus) Identical(rc.old$fmz, rc.new$fmz)

If those to things are identical, we can be pretty confident the rest is too.

cbroeckl commented 2 years ago

@xtrojak i see the commit for testing:

expect_equal(readLines(expected_msp), readLines(actual_msp))

and it would certainly seem true that if the new .msp file is identical to the old .msp file that the clustering is identical. did your test compare the original vs the updated and passed?

xtrojak commented 2 years ago

@cbroeckl sorry if my answer was confusing. The expected output in "testdata/output.msp" was created using the old version (currently in master) and is being compared to a msp file created using the new version in the test.

cbroeckl commented 2 years ago

@xtrojak - thanks! i will merge, do some testing locally, and push the new version to CRAN. It may take me a couple of days.

hechth commented 2 years ago

@cbroeckl Thank you very much! Yes, we verify that if we re-write an algorithm or so that the outputs of the tool/algorithm don't change so that they are equivalent. Refactoring means changing code structure without changing function 👍

We already had tests in place for our Galaxy tool, so we basically took those and put them into the actual RAMClustR code to check if the results are consistent.

cbroeckl commented 2 years ago

@xtrojak and @hechth - thanks for your contributions. I am glad that you are finding it valuable enough to continue to invest time into it!