EBI-Metabolights / SAFERnmr

4 stars 2 forks source link

memory is running out; hitting huge wall times #15

Closed judgemt closed 11 months ago

judgemt commented 11 months ago

Memory was exceeding 2 TB on the HPC and wall time of 3 days was insufficient for 424.

judgemt commented 11 months ago

Problem: backfitting for really permissive parameter sets on large datasets which produce lots features (e.g. 10^4 in MTBLS424) is extremely intensive, since we very locally optimize the position-fit of each ref-feature in each subset spectrum. This is the major bottleneck as it scales as the product of avg. ss.size and # of matches.

Solution: We implemented a parameter, max.backfits, to control the number of backfits. Explanation: Since we can estimate the upper bound for the number of backfits for each feature (each feature's ss.size x the number of matches it appears in), we can detect un-computable situations and make the best of them. We allocate max.backfits backfit spaces, then fill them up with the highest-r.val matches. This effectively sets a new r.val for matches such that the computations will actually finish. So we preserve the flexibility of feature definition/match detection, while still prioritizing the highest quality matches and fits. It is possible to use other metrics, such as our rmse.weighted, to prioritize matches, but for now, we'll stick with r.val since that is directly involved in scoring as well: (mean(match.rval, fsa.backfit) *fraction ref accounted).