berman-lab / ymap

YMAP - Yeast Mapping Analysis Pipeline : An online pipeline for the analysis of yeast genomic datasets.
MIT License
6 stars 6 forks source link

Implement a quota system #40

Closed vladimirg closed 7 years ago

vladimirg commented 8 years ago

As we've already hit the storage limit on lovelace, we should implement a quota system per user so that we can continue to grow.

Currently, a lot of intermediate files are kept - FASTQs, BAMs, pileups, etc. They are large, and they may not be needed. The files that are needed, besides the output images and the dataset configuration (useful for debugging), are the final SNP/CNV results, which can later be used to specify the dataset as a parent or use it to construct a hapmap. However, we do want to keep the original input files (whatever they are) in case of an error. Since the analysis is deterministic, if an error occurs in production, it should be reproducible in other environments, so the original input is all that's needed.

Estimating that a clean dataset will weigh no more than a few hundreds of MBs, a quota of 25 GB per user can support at least 50 analyzed datasets. We can also add an option to download the dataset results, if the user requires more projects (or increase the quota for that particular user).

Tasks:

After this task is complete, we should do a one-time sweep of lovelace and remove old dataset input and intermediate files.

ghost commented 8 years ago

Currently pushed local development branch Feature_Add-QuotaSystem. The feature seems to work fine, still need to check a bug that sometimes the cleaning stage is not done when running WG with parent, doesn't happen every time so far only when running multiple datasets in parallel (and even in these cases not all the times).

sum of changes:

  1. changed url in how to import to url to the project in berman-lab github
    1. Quota system: searching for complete.txt to calculate current usage size
      • in each user folder searches for quota.txt to determine user based quota
      • hardcoded is 25 located constants.php applies to all users (in case they don't have specific quota)
      • made entire site to refresh after deletion (to update current size calculation) note: should consider editing since it will interfere with active upload

these are the files that have been changed to be deleted in order to save space: in datasets

  1. datafiles (meaning fastq files) and datafile.txt
  2. data.pileup
  3. SNP_CNV_v1.txt zipped to SNP_CNV_v1.zip (for hapmap) and deleted
  4. data.bam
  5. data_sorted.bam and data_sorted.bam.bai
  6. data_indelRealigned.bam and data_indelRealigned.bam.bai
  7. putative_SNPs_v4.txt zipped to putative_SNPs_v4.zip (for hapmap use) and deleted - in radseq not zipped since maybe needed as parent so leaving putative_SNPs_v4.txt in genome
  8. .repetitiveness.txt

no need to delete anything in hapmaps since they don't weigh much.

vladimirg commented 8 years ago

One issue that popped up - when a dataset finishes, it doesn't update the usage stats. We could use Javascript to update that. While we're on it, we can also make deleting a dataset not refresh the page, so as to not interfere with any uploads, but still update the usage stats.

vladimirg commented 7 years ago

As the quota system (the first task) is live and well in production, this issue was closed and the two remaining tasks were split into separate issues - #54 and #55.