Open arschat opened 7 months ago
Summary in spreadsheet:
Number of Projects | Size in TiB | |
---|---|---|
all dirs & files in bucket | 449 | 197.65 |
non-bionetwork list | 299 | 122.68 |
non-bionetwork list & non-hca publication | 260 | 95.64 |
backup projects | 12 | 8.87 |
not in DCP (-submitted for next release) | 27 | 13.48 |
not in ingest | 14 | 8.87 |
has open submission | 40 | 13.46 |
We don't have a specific target for storage area reduction, but we do a first pass targeting a 30% reduction (59 GB). This way we can free up some space while minimising the time spent triaging the areas to be removed.
After we're done with this first pass I'll check in with Mary and Travis.
Candidates for the first pass
All areas need to be checked except for hca-publications
triage of areas:
Remove the non-bionetwork list & and organ of known bionetworks projects
Bellow is the list of uuids, that:
projectTitle != FALSE
hasOpenSubmission == FALSE
notAtlas == TRUE
nextRelease == FALSE
101 projects -> 50.35 TiB
And here is the list of the projects that are backups or integration tests, and are safe to remove:
safe for deletion == yes
AND contents == Integration Test
10 Projects -> 6.58 TiB
There is also one project that is permanently deleted from dcp, that is probably safe to remove from here too.
dd7ada84-3f14-4765-b7ce-9b64642bb3dc
1 project -> 1.14 TiB
Sum of those 3 options is 112 projects with 58.07 TiB size which is 24.94% of the total projects and 29.38% of total size.
removed:
In DCP Demo today, there was an interest on the dev staging area size and if we can reduce it. | Metric | Value |
---|---|---|
Sum (TiB) | 4.64 | |
Number of Projects | 810 | |
>1 TiB | 1 | |
>1 GiB | 19 | |
>1 MiB | 104 | |
>1 KiB | 685 |
ida to check if we've reduced the volume of data by enough
I've confirmed that the storage we've freed up is enough for now
Import team requested more free space. Re-opening to investigate options.
Did some more digging. From the list they provided I created Sheet6 in previous spreadsheet.
I wanted to investigate the number of projects which we hold all the data in our aws servers along with the gcp staging area. scripts used: aws_staging.txt gsutil_staging.txt
number of projects | |
---|---|
No of files == | 206 |
No of files != | 136 |
no info | 30 |
no file in aws | 126 |
since in gcp area we upload spreadsheet as supplementary file, I extracted the number of filenames with *metadata*xlsx
pattern and subtracted from total number of files in gcp
Projects in the first group are potentially safer to delete from staging, since we have all data to re-export everything if update is needed.
There was a request to reduce the size of the prod google bucket staging area (i.e.
gs://broad-dsp-monster-hca-prod-ebi-storage/prod/
).If we remove a project from the bucket, we won't be able to do partial update on that.
Action points: