biigle / maia

:m: BIIGLE module for the Machine Learning Assisted Image Annotation method
GNU General Public License v3.0
2 stars 3 forks source link

Delete unselected training proposals after select/refine stage #18

Closed mzur closed 5 years ago

mzur commented 5 years ago

It might be that the disk space required by many MAIA jobs (for training proposal and annotation candidate thumbnails) may be too much for our setup. In my tests, a single job required more than 1 GB of disk space. Even though we have plenty of free storage right now, we can't easily scale to a few hundred jobs. To mitigate this problem we could:

  1. Delete all patches of unselected training proposals once instance segmentation is started. This should be most effective. The UI must be able to handle this gracefully (i.e. no longer display unselected training proposals).
  2. Delete all patches of annotation candidates when they are converted to annotations. This might happen anyway as I plan to handle "selected" annotation candidates as if they no longer exist.
  3. Use the object storage to store patches. This is not easy as I found that the object storage does not handle millions of objects (which we reach easily) in a single container very well. We could wrap patches of a job in a tar or zip (as we do with tiled images) but the size of the archive would delay opening of a MAIA job by quite a bit and we would need a dedicated caching mechanism for this.
mzur commented 5 years ago

Point 1 is implemented. Point 2 won't be implemented as I now plan to display even converted annotation candidates permanently. There are much less annotation candidates than training proposals, too.

Point 3 might be interesting if we appear to run out of storage space. If we find a solution for this without having to assign a UUID to each proposal/candidate, this would be useful for regular annotation patches as well.

mzur commented 5 years ago

Maybe the URLs for annotation patches, training proposals and annotation candidates can be a combination of image UUID and model ID. Example:

patches/aa/14/aa141561-5b7c-43f1-8f6d-75795eaa1b03/12345.jpg

Where aa141561-5b7c-43f1-8f6d-75795eaa1b03 is the image UUID and 12345 is the ID of the annotation patch/training proposal/annotation candidate. This would allow for public serving of the files as the image UUID is hard to guess. And even if somebody guesses the UUID, they still have to try all model IDs. Even then they can grab only the thumbnails of a single image.

This can be implemented as local storage disk that is made public like the image thumbnails are now. Or it can be a public cloud storage disk, accessed through an Nginx reverse proxy.

mzur commented 5 years ago

This has been implemented with #35.