LieberInstitute / brainseq_phase2

BrainSeq Phase II project lead by LIBD for the BrainSeq Consortium
http://eqtl.brainseq.org/
8 stars 5 forks source link

BSP2 files in /dcl01/lieber and /dcl01/ajaffe #30

Closed lcolladotor closed 5 years ago

lcolladotor commented 5 years ago

/dcl01/lieber

$ du -sh --apparent-size /dcl01/lieber/ajaffe/lab/brainseq_phase2/preprocessed_data
date
27T        .

$ du -sh --apparent-size /dcl01/lieber/ajaffe/lab/brainseq_phase2/preprocessed_data/*
49K        BrainSeq_Phase2_phenotype_data_small_n900.csv
257        brainseq_phase2.Rproj
130M        brainspan
71K        browser
110M        caseControl
3.4K        caseControl_analysis_DLPFC.R
3.3K        caseControl_analysis_hippo.R
321M        caseControl_HIPPO_checks
2.5G        casecontrolint
454K        cellComp
7.6G        correlation
2.3G        count_data
88M        degradation
7.3M        degradation_strand_minus
7.6M        degradation_strand_positive
2.0M        demographics
22G        development
3.2G        eQTL_DNAm_mediation
20G        eQTL_full
63G        eQTL_full_GTEx
560M        eQTL_GWAS_riskSNPs
1.6K        eqtl_stats.R
8.7G        expr_cutoff
19G        genotype_data
6.6G        genotype_data_carrie_n410
6.4K        get_degradation_regions.R
1.6K        get_joint_degradation_regions.R
456G        gtex
784M        gtex_both
556G        gtex_dlpfc
209K        jxns
104K        LIBD_PhaseII_HIPPO_RiboZero_sample_list_01_28_2015.xlsx
33K        pdf
26T        preprocessed_data
60M        psychENCODE
3.3K        pull_genotype_data.R
142K        qcChecks
130        README.md
16G        region_specific
6.0K        samples_to_extract.txt
11K        subset_samples_dlpfc.R
7.9K        subset_samples_hippo.R
93G        wgcna
52G        wgcna_combined

$ du -sh --apparent-size /dcl01/lieber/ajaffe/lab/brainseq_phase2/preprocessed_data/*
109K        DLPFC_PhaseII_sample-selection_04_20_2015.csv
14T        DLPFC_RiboZero
558G        Hippo_Dropped
12T        Hippo_RiboZero
$ ls -lh /dcl01/ajaffe/data/lab/brainseq_phase2/preprocessed_data/
total 173K
-rw-rw-r-- 1 ajaffe lieber_jaffe 109K Apr 18 2017 DLPFC_PhaseII_sample-selection_04_20_2015.csv
drwxrwxr-x 15 ajaffe lieber_jaffe 41K Aug 24 2017 DLPFC_RiboZero
drwxrwx--- 10 ajaffe lieber_jaffe 33K Jun 22 2017 Hippo_Dropped
drwxrwxr-x 13 ajaffe lieber_jaffe 41K Jul 1 2017 Hippo_RiboZero

/dcl01/ajaffe

$ du -sh --apparent-size /dcl01/ajaffe/data/lab/brainseq_phase2
39T        .
$ du -sh --apparent-size /dcl01/ajaffe/data/lab/brainseq_phase2/*
2.4M        caseControl
1.9K        caseControl_analysis_hippo.R
3.3G        count_data
427G        degradation
330        eQTL_dlpfc.sh
338        eQTL_hippo.sh
344        eQTL_interaction.sh
5.6G        eqtl_tables
11G        genotype_data
5.4K        get_degradation_regions.R
39T        preprocessed_data
2.8K        pull_genotype_data.R
143K        qcChecks
1.4M        rdas
130        README.md
8.8K        run_eqtls_dlpfc.R
8.8K        run_eqtls_hippo.R
12K        run_eqtls_interaction.R
6.0K        samples_to_extract.txt
9.3K        subset_samples_dlpfc.R
8.5K        subset_samples_hippo.R
$ ls -lh
total 376K
drwxrws--- 2 ajaffe lieber_jaffe 33K Jul 15 2017 caseControl
-rwxrwx--- 1 ajaffe lieber_jaffe 1.9K Jul 15 2017 caseControl_analysis_hippo.R
drwxrws--- 2 ajaffe lieber_jaffe 33K Aug 24 2017 count_data
drwxrws--- 5 ajaffe lieber_jaffe 89K Jul 14 2017 degradation
-rw-rw-r-- 1 ajaffe lieber_jaffe 330 Aug 21 2017 eQTL_dlpfc.sh
-rw-rw-r-- 1 ajaffe lieber_jaffe 338 Aug 21 2017 eQTL_hippo.sh
-rw-rw-r-- 1 ajaffe lieber_jaffe 344 Sep 25 2017 eQTL_interaction.sh
drwxrwsr-x 4 ajaffe lieber_jaffe 33K Aug 26 2017 eqtl_tables
drwxrws--- 2 ajaffe lieber_jaffe 33K Jun 22 2017 genotype_data
-rwxrwx--- 1 ajaffe lieber_jaffe 5.4K Jul 17 2017 get_degradation_regions.R
drwxrwxr-x 5 ajaffe lieber_jaffe 33K Jul 1 2017 preprocessed_data
-rwxrwx--- 1 ajaffe lieber_jaffe 2.8K Aug 21 2017 pull_genotype_data.R
drwxrws--- 2 ajaffe lieber_jaffe 33K Jun 22 2017 qcChecks
drwxrwsr-x 2 ajaffe lieber_jaffe 33K Jul 27 2017 rdas
-rw-rw---- 1 ajaffe lieber_jaffe 130 Jan 11 2017 README.md
-rw-rw-r-- 1 ajaffe lieber_jaffe 8.8K Jul 24 2017 run_eqtls_dlpfc.R
-rw-rw-r-- 1 ajaffe lieber_jaffe 8.8K Jul 24 2017 run_eqtls_hippo.R
-rw-rw-r-- 1 ajaffe lieber_jaffe 12K Sep 25 2017 run_eqtls_interaction.R
-rwxrwx--- 1 ajaffe lieber_jaffe 6.0K Aug 21 2017 samples_to_extract.txt
-rwxrwx--- 1 ajaffe lieber_jaffe 9.3K Jun 22 2017 subset_samples_dlpfc.R
-rwxrwx--- 1 ajaffe lieber_jaffe 8.5K Jun 22 2017 subset_samples_hippo.R

$ du -sh --apparent-size /dcl01/ajaffe/data/lab/brainseq_phase2/preprocessed_data/*
109K        DLPFC_PhaseII_sample-selection_04_20_2015.csv
21T        DLPFC_RiboZero
1.2T        Hippo_Dropped
17T        Hippo_RiboZero
$ ls -lh /dcl01/ajaffe/data/lab/brainseq_phase2/preprocessed_data
total 173K
-rw-rw-r-- 1 ajaffe lieber_jaffe 109K Apr 18 2017 DLPFC_PhaseII_sample-selection_04_20_2015.csv
drwxrwxr-x 15 ajaffe lieber_jaffe 41K Aug 24 2017 DLPFC_RiboZero
drwxrwx--- 10 ajaffe lieber_jaffe 33K Jun 22 2017 Hippo_Dropped
drwxrwxr-x 13 ajaffe lieber_jaffe 41K Jul 1 2017 Hippo_RiboZero

Find all files

## R code for checking files in both dirs
f1 <- list.files('/dcl01/lieber/ajaffe/lab/brainseq_phase2', recursive = TRUE, include.dirs = TRUE)
f2 <- list.files('/dcl01/ajaffe/data/lab/brainseq_phase2', recursive = TRUE, include.dirs = TRUE)

f3 <- intersect(f1, f2)
f1b <- f1[!f1 %in% f3]
f2b <- f2[!f2 %in% f3]

length(f1)
length(f1b)
length(f2)
length(f2b)
length(f3)
head(sort(table(gsub('.*\\.', '', f1b)), decreasing = TRUE), n = 30)
head(sort(table(gsub('.*\\.', '', f2b)), decreasing = TRUE), n = 30)

length(f2b[grep('preprocessed_data', f2b)])
head(sort(table(gsub('.*\\.', '', f2b[grep('preprocessed_data', f2b)])), decreasing = TRUE), n = 30)
## Files in /dcl01/lieber
> length(f1)
[1] 131750
## Files in /dcl01/lieber not in /dcl01/ajaffe
> length(f1b)
[1] 18201
## Files in /dcl01/ajaffe
> length(f2)
[1] 118967
## Files in /dcl01/ajaffe not in /dcl01/lieber
> length(f2b)
[1] 5418
## Files in both locations
> length(f3)
[1] 113549

## Most common file extensions from files in /dcl01/lieber not in /dcl01/ajaffe
> head(sort(table(gsub('.*\\.', '', f1b)), decreasing = TRUE), n = 10)

png txt tsv gz html json counts summary fo zip
5269 2985 1096 1012 809 606 406 406 404 404
## Most common file extensions from files in /dcl01/ajaffe not in /dcl01/lieber
> head(sort(table(gsub('.*\\.', '', f2b)), decreasing = TRUE), n = 10)

png bam bw txt tsv gz html json counts fo
1120 1032 589 588 442 308 160 120 80 80
## Common files extensions in /dcl01/ajaffe not in /dcl01/lieber under the preprocessed_data dir
> head(sort(table(gsub('.*\\.', '', f2b[grep('preprocessed_data', f2b)])), decreasing = TRUE))

bam
992
bw
526
gz
108
rda
3
preprocessed_data/Hippo_Dropped/merged_fastq
1

Takeaways

I imagine that we don't have any files in /dcl01/ajaffe that we want to keep and don't have at /dcl01/lieber. If so, we can simply delete /dcl01/ajaffe/data/lab/brainseq_phase2 and gain 39 TB there.

But I don't know if @andrewejaffe @emilyburke or anyone else deleted files in /dcl01/lieber/ajaffe/lab/brainseq_phase2 since 2017 knowing that there was a copy in /dcl01/ajaffe/data/lab/brainseq_phase2 that we'd want to keep. If so, we need to dig in deeper into all the files. Or we could maybe do 2 rsyncs:

lcolladotor commented 5 years ago

drop this for now