## R code for checking files in both dirs
f1 <- list.files('/dcl01/lieber/ajaffe/lab/brainseq_phase2', recursive = TRUE, include.dirs = TRUE)
f2 <- list.files('/dcl01/ajaffe/data/lab/brainseq_phase2', recursive = TRUE, include.dirs = TRUE)
f3 <- intersect(f1, f2)
f1b <- f1[!f1 %in% f3]
f2b <- f2[!f2 %in% f3]
length(f1)
length(f1b)
length(f2)
length(f2b)
length(f3)
head(sort(table(gsub('.*\\.', '', f1b)), decreasing = TRUE), n = 30)
head(sort(table(gsub('.*\\.', '', f2b)), decreasing = TRUE), n = 30)
length(f2b[grep('preprocessed_data', f2b)])
head(sort(table(gsub('.*\\.', '', f2b[grep('preprocessed_data', f2b)])), decreasing = TRUE), n = 30)
## Files in /dcl01/lieber
> length(f1)
[1] 131750
## Files in /dcl01/lieber not in /dcl01/ajaffe
> length(f1b)
[1] 18201
## Files in /dcl01/ajaffe
> length(f2)
[1] 118967
## Files in /dcl01/ajaffe not in /dcl01/lieber
> length(f2b)
[1] 5418
## Files in both locations
> length(f3)
[1] 113549
## Most common file extensions from files in /dcl01/lieber not in /dcl01/ajaffe
> head(sort(table(gsub('.*\\.', '', f1b)), decreasing = TRUE), n = 10)
png txt tsv gz html json counts summary fo zip
5269 2985 1096 1012 809 606 406 406 404 404
## Most common file extensions from files in /dcl01/ajaffe not in /dcl01/lieber
> head(sort(table(gsub('.*\\.', '', f2b)), decreasing = TRUE), n = 10)
png bam bw txt tsv gz html json counts fo
1120 1032 589 588 442 308 160 120 80 80
## Common files extensions in /dcl01/ajaffe not in /dcl01/lieber under the preprocessed_data dir
> head(sort(table(gsub('.*\\.', '', f2b[grep('preprocessed_data', f2b)])), decreasing = TRUE))
bam
992
bw
526
gz
108
rda
3
preprocessed_data/Hippo_Dropped/merged_fastq
1
Takeaways
/dcl01/ajaffe has older files than /dcl01/lieber.
/dcl01/ajaffe has way more files under the degradation directory than /dcl01/lieber.
/dcl01/ajaffe has way more files under the preprocessed_data directory: likely BAM and BigWig files we deleted in /dcl01/lieber already.
I imagine that we don't have any files in /dcl01/ajaffe that we want to keep and don't have at /dcl01/lieber. If so, we can simply delete /dcl01/ajaffe/data/lab/brainseq_phase2 and gain 39 TB there.
But I don't know if @andrewejaffe @emilyburke or anyone else deleted files in /dcl01/lieber/ajaffe/lab/brainseq_phase2 since 2017 knowing that there was a copy in /dcl01/ajaffe/data/lab/brainseq_phase2 that we'd want to keep. If so, we need to dig in deeper into all the files. Or we could maybe do 2 rsyncs:
rsync from /dcl01/lieber to /dcl01/ajaffe (assumption: any duplicated file that is not equal would be newer in /dcl01/lieber)
then rsync from /dcl01/ajaffe to /dcl01/lieber to get all the files that were deleted in /dcl01/lieber that we might want to keep.
delete again files we really don't want anywhere at all from /dcl01/lieber
/dcl01/lieber
/dcl01/ajaffe
Find all files
Takeaways
/dcl01/ajaffe
has older files than/dcl01/lieber
./dcl01/ajaffe
has way more files under thedegradation
directory than/dcl01/lieber
./dcl01/ajaffe
has way more files under thepreprocessed_data
directory: likely BAM and BigWig files we deleted in/dcl01/lieber
already.I imagine that we don't have any files in
/dcl01/ajaffe
that we want to keep and don't have at/dcl01/lieber
. If so, we can simply delete/dcl01/ajaffe/data/lab/brainseq_phase2
and gain 39 TB there.But I don't know if @andrewejaffe @emilyburke or anyone else deleted files in
/dcl01/lieber/ajaffe/lab/brainseq_phase2
since 2017 knowing that there was a copy in/dcl01/ajaffe/data/lab/brainseq_phase2
that we'd want to keep. If so, we need to dig in deeper into all the files. Or we could maybe do 2 rsyncs:/dcl01/lieber
to/dcl01/ajaffe
(assumption: any duplicated file that is not equal would be newer in/dcl01/lieber
)/dcl01/ajaffe
to/dcl01/lieber
to get all the files that were deleted in/dcl01/lieber
that we might want to keep./dcl01/lieber
/dcl01/ajaffe
to save the 39 + TB