aces / cbrain

CBRAIN is a flexible Ruby on Rails framework for accessing and processing of large data on high-performance computing infrastructures.
GNU General Public License v3.0
71 stars 43 forks source link

Create new Boutiques module 'BoutiquesInputCacheCleaner' #1203

Closed prioux closed 2 years ago

prioux commented 2 years ago

This module would invoke 'cache_erase' on input files, once a task is complete, using a new specification in the descriptor:

"custom": {
  "cbrain:integrator_modules": {
    "BoutiquesInputCacheCleaner": [
      "inputid1",
      "inputid2"
    ]
  }
}

That way when processing large datasets in parallel where the dataset is split in chunks (e.g. BidsSubjects of the UKBB), we can erase the subjec't data from the cache and free some disk space.

MontrealSergiy commented 2 years ago

@prioux should it support cbcsv lists?

prioux commented 2 years ago

cbcsv files are replaced by their individual file components at that point, so there is no need to even implement anything particular.

prioux commented 2 years ago

I'm increasing the priority on this one because we'll need it for the UK BioBank processing.

prioux commented 2 years ago

I'm adding a requirement. To prevent a task from erasing an input that is also used by another task, make a check on the timestamp of the SyncStatus object.

After the setup() method, record a timestamp in the meta data of the task:

task.meta[:setup_time] = Time.now

then just before attempting the cleanup of the input file, fetch its SyncStatus object and compare with the timestamp recorded:

setup_time = inputfile.local_sync_status&.accessed_at || Time.now
if setup_time <= task.meta[:setup_time]
   erase here
end

The accessed_at attribute is the one being updated whenever any process invokes sync_to_cache()

MontrealSergiy commented 2 years ago
  1. @prioux I guess I can start with it, but are you sure the timestamp comparison will conflict?- But my feeling this might fail if interleaving (due to other load) tasks running on different processors and different tools can interleave, starting task earlier does not guaranty an early finish. My feeling is that a better locking can be devised though with more efforts, like storing each input file timestamp.
MontrealSergiy commented 2 years ago
MontrealSergiy commented 2 years ago

with some effort I managed to get exception (I guess restarting many tasks one after one) image