Implement a quota system

vladimirg commented 8 years ago

As we've already hit the storage limit on lovelace, we should implement a quota system per user so that we can continue to grow.

Currently, a lot of intermediate files are kept - FASTQs, BAMs, pileups, etc. They are large, and they may not be needed. The files that are needed, besides the output images and the dataset configuration (useful for debugging), are the final SNP/CNV results, which can later be used to specify the dataset as a parent or use it to construct a hapmap. However, we do want to keep the original input files (whatever they are) in case of an error. Since the analysis is deterministic, if an error occurs in production, it should be reproducible in other environments, so the original input is all that's needed.

Estimating that a clean dataset will weigh no more than a few hundreds of MBs, a quota of 25 GB per user can support at least 50 analyzed datasets. We can also add an option to download the dataset results, if the user requires more projects (or increase the quota for that particular user).

Tasks:

[x] The quota system:
1. There should be a global quota for all users (say, 25 GB).
2. There should be an option to set an individual quota for a user.
3. We should document each file in a dataset, and aggressively remove those not necessary in the cleanup stage.
4. Next to to each dataset in the Upload Dataset tab we should display the dataset's total memory footprint. The same for every genome and hapmap of the user. We should also display the quota and the remaining space.
5. When the user wants to upload a dataset, reference genome or hapmap, and they're past their quota, the system will warn them of this and advise to remove datasets.
[ ] Optional - monitoring analysis and saving input files:
1. There should be an error-checking step immediately before the cleanup step. The script will verify that all output files are as they should be, and that there are no special errors in the logs (should be handled with caution, as the word 'error' does appear in the logs in a non-fatal situation at this point).
2. If something went wrong, we should:
  1. Alert the user immediately (instead of relying on a timeout).
  2. Store the original input file(s). These can be text files - e.g. FASTQs or SAMs (and then they should be zipped), or binary files (zipped FASTQs or BAMs), and then they can be stored as-is. We should still delete all other intermediate files, as the original files are sufficient to re-run the analysis and get the same errors.
  3. Alert admins. The simplest option is writing to a special log that keeps dataset failures, which we will check every few days.
3. If nothing went wrong, we remove the intermediate files as normal. One exception is the debug mode - in which all intermediates should be kept always.
4. We should make it clear to the user that Ymap is not a storage platform and that they should make sure to have a backup of their uploaded datasets.
[ ] Optional - allow users to download a zipped bundle of analysis results per dataset. This should include:
1. The figures.
2. The GBrowse tracks.
3. A README explaining the files and including instructions on citing Ymap.

After this task is complete, we should do a one-time sweep of lovelace and remove old dataset input and intermediate files.

ghost commented 8 years ago

Currently pushed local development branch Feature_Add-QuotaSystem. The feature seems to work fine, still need to check a bug that sometimes the cleaning stage is not done when running WG with parent, doesn't happen every time so far only when running multiple datasets in parallel (and even in these cases not all the times).

sum of changes:

changed url in how to import to url to the project in berman-lab github
1. Quota system: searching for complete.txt to calculate current usage size
  - in each user folder searches for quota.txt to determine user based quota
  - hardcoded is 25 located constants.php applies to all users (in case they don't have specific quota)
  - made entire site to refresh after deletion (to update current size calculation) note: should consider editing since it will interfere with active upload

these are the files that have been changed to be deleted in order to save space: in datasets

datafiles (meaning fastq files) and datafile.txt
data.pileup
SNP_CNV_v1.txt zipped to SNP_CNV_v1.zip (for hapmap) and deleted
data.bam
data_sorted.bam and data_sorted.bam.bai
data_indelRealigned.bam and data_indelRealigned.bam.bai
putative_SNPs_v4.txt zipped to putative_SNPs_v4.zip (for hapmap use) and deleted - in radseq not zipped since maybe needed as parent so leaving putative_SNPs_v4.txt in genome
.repetitiveness.txt

no need to delete anything in hapmaps since they don't weigh much.

vladimirg commented 8 years ago

One issue that popped up - when a dataset finishes, it doesn't update the usage stats. We could use Javascript to update that. While we're on it, we can also make deleting a dataset not refresh the page, so as to not interfere with any uploads, but still update the usage stats.

vladimirg commented 8 years ago

As the quota system (the first task) is live and well in production, this issue was closed and the two remaining tasks were split into separate issues - #54 and #55.

berman-lab / ymap

Implement a quota system #40