AlexsLemonade / training-modules

A collection of modules that are combined into 1-5 day workshops on computational topics for the childhood cancer research community.
Other
60 stars 27 forks source link

Discussion: Directory structure for modules on server #246

Open jashapiro opened 4 years ago

jashapiro commented 4 years ago

The current directory structure for the scRNA module set looks something like this:

training-modules/scRNA-seq$ tree
.
├── 00-scRNA-seq_introduction.md
├── 01-normalizing_scRNA-seq.Rmd
├── 01-normalizing_scRNA-seq.nb.html
├── 02-tag-based_pre-processing_scRNA-seq.html
├── 02-tag-based_pre-processing_scRNA-seq.md
├── 03-dimension_reduction_scRNA-seq.Rmd
├── 03-dimension_reduction_scRNA-seq.nb.html
├── 04-scrnaseq_exercise.Rmd
├── README.md
├── data
│   ├── glioblastoma
│   │   └── raw
│   │       ├── unfiltered_darmanis_counts.tsv
│   │       └── unfiltered_darmanis_metadata.tsv
│   └── tabula_muris
│       ├── fastq
│       │   ├── tab_mur_10X_P4_3_L001_R1_subset.fastq.gz
│       │   └── tab_mur_10X_P4_3_L001_R2_subset.fastq.gz
│       ├── normalized
│       │   ├── scran_norm_tab_mur.tsv
│       │   └── tab_mur_metadata.tsv
│       └── qc_reports
│           ├── 10X_P4_3_qc_report.html
│           └── Bad_Example_10X_P4_2_qc_report.html
├── diagrams
├── figures
│   ├── gbm_figure.jpg
│   └── sce_structure.png
├── index
├── scripts
│   ├── gene_matrix_filter.R
│   └── read_alevin.R
└── setup
    ├── README.md
    ├── glioblastoma
    └── tabula-muris

There isn't anything inherently wrong with this, but with using the Rstudio server, we now plan to have the raw data stored in shared/data/ to avoid unnecessary duplication.

There are two options that I see: the first is to leave things mostly as they are, but remove the raw data and put the paths to the files directly in the notebooks. This would look something like this, using the current paths in /shared/data:

data_dir <- file.path("~", "shared-data", "training-data", "darmanis")
gene_matrix_file <- file.path(data_dir, "tximport", "count_matrix.tsv")

Note that this uses ~/shared-data/ for the reason that this symlink that we set up for every user is visible in the RStudio file browser without having to use "Go to folder".

The other option is to add moar symlinks within the module, so the new directory structure might look like this:

├── 00-scRNA-seq_introduction.md
├── 01-normalizing_scRNA-seq.Rmd
├── 01-normalizing_scRNA-seq.nb.html
├── 02-tag-based_pre-processing_scRNA-seq.html
├── 02-tag-based_pre-processing_scRNA-seq.md
├── 03-dimension_reduction_scRNA-seq.Rmd
├── 03-dimension_reduction_scRNA-seq.nb.html
├── 04-scrnaseq_exercise.Rmd
├── README.md
├── analysis
│   └── glioblastoma
│       └── normalized
├── data
│   ├── glioblastoma -> ~/shared-data/training-data/darmanis/
│   └── tabula_muris -> ~/shared-data/training-data/tabula_muris/

This would more closely mirror a real project, as the data would appear to be in the module folder and could be referred to that way. It can also encourage the practice that a data folder is for reading only, and anything you write goes in a separate location (though I am not too strict about this myself, as long as there is a raw subfolder of some kind within data).

Thoughts on which of these two options is preferred, or modifications to the proposed structures?

jaclyn-taroni commented 4 years ago

Here's an abbreviated version of how we handle file paths for RNA-seq:

.
├── 00a-reproducibility_cmdline.md
├── 00b-additional_approaches.md
├── 01-qc_trim_quant.html
├── 01-qc_trim_quant.md
├── 02-gastric_cancer_tximport-live.Rmd
├── 02-gastric_cancer_tximport-live.nb.html
├── 03-gastric_cancer_exploratory-live.Rmd
├── 03-gastric_cancer_exploratory-live.nb.html
├── 03b-exploratory_data_analysis_exercise.Rmd
├── 04-nb_cell_line_tximport.md
├── 05-nb_cell_line_DESeq2-live.Rmd
├── 05-nb_cell_line_DESeq2-live.nb.html
├── 06-bulk_rnaseq_exercise.Rmd
├── RNA-seq.Rproj
├── data
│   ├── fastq
│   │   └── gastric_cancer
│   │       └── SRR585570
│   │           ├── SRR585570_1.fastq.gz -> /shared/data/training-modules/RNA-seq/data/fastq/gastric_cancer/SRR585570/SRR585570_1.fastq.gz
│   │           ├── SRR585570_2.fastq.gz -> /shared/data/training-modules/RNA-seq/data/fastq/gastric_cancer/SRR585570/SRR585570_2.fastq.gz
│   │           ├── SRR585570_fastp_1.fastq.gz
│   │           └── SRR585570_fastp_2.fastq.gz
├── index
│   └── Homo_sapiens
│       ├── Homo_sapiens.GRCh38.95_tx2gene.tsv -> /shared/data/reference/tx2gene/Homo_sapiens.GRCh38.95_tx2gene.tsv
│       └── short_index
│           ├── complete_ref_lens.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//complete_ref_lens.bin
│           ├── ctable.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//ctable.bin
│           ├── ctg_offsets.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//ctg_offsets.bin
│           ├── duplicate_clusters.tsv -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//duplicate_clusters.tsv
│           ├── info.json -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//info.json
│           ├── mphf.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//mphf.bin
│           ├── pos.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//pos.bin
│           ├── pre_indexing.log -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//pre_indexing.log
│           ├── rank.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//rank.bin
│           ├── refAccumLengths.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//refAccumLengths.bin
│           ├── ref_indexing.log -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//ref_indexing.log
│           ├── reflengths.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//reflengths.bin
│           ├── refseq.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//refseq.bin
│           ├── seq.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//seq.bin
│           └── versionInfo.json -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//versionInfo.json
├── plots
│   └── NB_cell_line_heatmap.png
├── results
│   ├── NB_cell_line_DESeq_amplified_v_nonamplified.RDS
│   └── NB_cell_line_DESeq_amplified_v_nonamplified_results.tsv
└── scripts
    └── run_SRR585570.sh

which is more similar to the second option you describe. This was set up, in part, to reduce the modifications to paths we would have to make for virtual vs. in-person workshops that were run on participants' own laptops. That fact is not a vote to set things up this way, but I do like the idea of keeping things similar to what a real project would be like. Setting up symlinks does seem like it may take more time/effort, though.

jashapiro commented 4 years ago

I think that was the way I was leaning, with the difference that I will link at the directory level, rather than separately to each file within the index, for example.

jaclyn-taroni commented 4 years ago

I think that was the way I was leaning, with the difference that I will link at the directory level, rather than separately to each file within the index, for example.

And that's what the raw subfolder helps with!

jashapiro commented 4 years ago

I wanted to document the current 2020-july directory structure as we head into training here:

$ tree /shared/data/training-modules/2020-july/
/shared/data/training-modules/2020-july/
├── machine-learning
│   ├── 01-openpbta_heatmap-live.Rmd
│   ├── 02-openpbta_consensus_clustering-live.Rmd
│   ├── 03-openpbta_PLIER-live.Rmd
│   ├── 04-openpbta_plot_LV-live.Rmd
│   ├── 05-machine_learning_exercise.Rmd
│   ├── data
│   │   ├── GSE116436 -> /shared/data/training-data/machine-learning/data/GSE116436/
│   │   └── open-pbta
│   │       ├── download -> /shared/data/training-data/machine-learning/data/open-pbta/download
│   │       └── processed
│   │           ├── pbta-histologies-stranded-rnaseq.tsv -> /shared/data/training-data/machine-learning/data/open-pbta/processed/pbta-histologies-stranded-rnaseq.tsv
│   │           └── pbta-vst-stranded.tsv -> /shared/data/training-data/machine-learning/data/open-pbta/processed/pbta-vst-stranded.tsv
│   ├── diagrams
│   │   ├── mao_nature_methods_fig1.png
│   │   ├── monti_gaussian3_cdf_delta.png
│   │   ├── monti_gaussian3_consensus_matrix.png
│   │   └── monti_gaussian5.png
│   ├── machine-learning.Rproj
│   ├── models
│   │   ├── NCI60_PLIER_model.RDS -> /shared/data/training-data/machine-learning/models/NCI60_PLIER_model.RDS
│   │   └── NCI60_prcomp_results.RDS -> /shared/data/training-data/machine-learning/models/NCI60_prcomp_results.RDS
│   └── setup
│       ├── 00-data-download.sh
│       ├── 01-transform-rnaseq.Rmd
│       ├── 01-transform-rnaseq.nb.html
│       └── README.md
├── pathway-analysis
│   ├── 01-overrepresentation_analysis-live.Rmd
│   ├── 02-gene_set_enrichment_analysis-live.Rmd
│   ├── 03-gene_set_variation_analysis-live.Rmd
│   ├── 04-pathway_analysis_exercise.Rmd
│   ├── data -> /shared/data/training-data/pathway-analysis/data/
│   ├── diagrams
│   │   ├── hanzelmann_fig1.jpg
│   │   └── subramanian_fig1.jpg
│   ├── pathway-analysis.Rproj
│   ├── results
│   │   └── gene-metrics
│   │       ├── nb_cell_line_mycn_amplified_v_nonamplified.tsv
│   │       └── pdx_medulloblastoma_treatment_dge.tsv
│   └── setup
│       ├── 01-prepare_NB_cell_line.Rmd
│       ├── 01-prepare_NB_cell_line.nb.html
│       ├── 02-prepare_openpbta_MB_data.Rmd
│       ├── 02-prepare_openpbta_MB_data.nb.html
│       └── README.md
└── scRNA-seq
    ├── 00-scRNA-seq_introduction.html
    ├── 00-scRNA-seq_introduction.md
    ├── 01-filtering_scRNA-seq-live.Rmd
    ├── 02-normalizing_scRNA-seq-live.Rmd
    ├── 03-scrnaseq_day1_exercise.Rmd
    ├── 04-tag-based_scRNA-seq_processing-live.Rmd
    ├── 05-dimension_reduction_scRNA-seq-live.Rmd
    ├── 06-scrnaseq_day2_exercise.Rmd
    ├── README.md
    ├── data
    │   ├── glioblastoma
    │   │   ├── hs_mitochondrial_genes.tsv
    │   │   └── preprocessed -> /shared/data/training-data/darmanis
    │   └── tabula-muris
    │       ├── TM_droplet_metadata.csv
    │       ├── fastq-raw -> /shared/data/training-data/tabula-muris/fastq
    │       ├── mm_ensdb95_tx2gene.tsv
    │       ├── mm_mitochondrial_genes.tsv
    │       └── normalized
    │           └── TM_normalized.rds -> /shared/data/training-data/tabula-muris/normalized/TM_normalized.rds
    ├── diagrams
    │   ├── full-length_1.png
    │   ├── full-length_2.png
    │   ├── full-length_3.png
    │   ├── glioblastoma_dir_structure.png
    │   ├── overview_workflow.png
    │   ├── tag-based_1.png
    │   ├── tag-based_2.png
    │   └── tag-based_3.png
    ├── figures
    │   ├── gbm_figure.jpg
    │   ├── pca_tabula_muris.png
    │   └── sce_structure.png
    ├── pca_tabula_muris.png
    ├── qc-reports
    │   └── Bad_Example_10X_P4_2_alevinqc.html
    └── scRNA-seq.Rproj

25 directories, 60 files

magic directory, where "cooking show magic" files and backups for processed data live.

$ tree /shared/data/training-modules/magic/
/shared/data/training-modules/magic/
├── 2020-july
│   ├── machine-learning
│   │   └── models
│   │       └── pbta-medulloblastoma-plier.RDS
│   └── scRNA-seq
│       ├── analysis
│       │   └── glioblastoma
│       │       └── markers
│       │           ├── Astocyte_markers.tsv
│       │           ├── Immune cell_markers.tsv
│       │           ├── Neoplastic_markers.tsv
│       │           ├── Neuron_markers.tsv
│       │           ├── OPC_markers.tsv
│       │           ├── Oligodendrocyte_markers.tsv
│       │           └── Vascular_markers.tsv
│       ├── data
│       │   ├── glioblastoma
│       │   │   ├── filtered
│       │   │   │   └── filtered_count_matrix.tsv
│       │   │   ├── hs_mitochondrial_genes.tsv
│       │   │   └── normalized
│       │   │       ├── glioblastoma_sce.RDS
│       │   │       └── scran_norm_gene_matrix.tsv
│       │   └── tabula-muris
│       │       ├── TM_droplet_metadata.csv
│       │       ├── alevin-quant
│       │       │   ├── 10X_P4_3
│       │       │   │   ├── alevin
│       │       │   │   │   ├── alevin.log
│       │       │   │   │   ├── featureDump.txt
│       │       │   │   │   ├── predictions.txt
│       │       │   │   │   ├── quants_mat.gz
│       │       │   │   │   ├── quants_mat_cols.txt
│       │       │   │   │   ├── quants_mat_rows.txt
│       │       │   │   │   ├── quants_tier_mat.gz
│       │       │   │   │   ├── raw_cb_frequency.txt
│       │       │   │   │   └── whitelist.txt
│       │       │   │   ├── aux_info
│       │       │   │   │   ├── alevin_meta_info.json
│       │       │   │   │   ├── ambig_info.tsv
│       │       │   │   │   ├── expected_bias.gz
│       │       │   │   │   ├── fld.gz
│       │       │   │   │   ├── meta_info.json
│       │       │   │   │   ├── observed_bias.gz
│       │       │   │   │   └── observed_bias_3p.gz
│       │       │   │   ├── cmd_info.json
│       │       │   │   ├── libParams
│       │       │   │   │   └── flenDist.txt
│       │       │   │   ├── lib_format_counts.json
│       │       │   │   └── logs
│       │       │   │       └── salmon_quant.log
│       │       │   └── 10X_P7_0
│       │       │       ├── alevin
│       │       │       │   ├── alevin.log
│       │       │       │   ├── featureDump.txt
│       │       │       │   ├── predictions.txt
│       │       │       │   ├── quants_mat.gz
│       │       │       │   ├── quants_mat_cols.txt
│       │       │       │   ├── quants_mat_rows.txt
│       │       │       │   ├── quants_tier_mat.gz
│       │       │       │   ├── raw_cb_frequency.txt
│       │       │       │   └── whitelist.txt
│       │       │       ├── aux_info
│       │       │       │   ├── alevin_meta_info.json
│       │       │       │   ├── ambig_info.tsv
│       │       │       │   ├── expected_bias.gz
│       │       │       │   ├── fld.gz
│       │       │       │   ├── meta_info.json
│       │       │       │   ├── observed_bias.gz
│       │       │       │   └── observed_bias_3p.gz
│       │       │       ├── cmd_info.json
│       │       │       ├── libParams
│       │       │       │   └── flenDist.txt
│       │       │       ├── lib_format_counts.json
│       │       │       └── logs
│       │       │           └── salmon_quant.log
│       │       ├── mm_ensdb95_tx2gene.tsv
│       │       └── mm_mitochondrial_genes.tsv
│       └── qc-reports
│           ├── 10X_P4_3_qc_report.html
│           └── Bad_Example_10X_P4_2_alevinqc.html
└── 2020-june
    └── intro-to-R-tidyverse
        ├── GSE19578.tsv
        ├── GSE44971.tsv
        ├── cleaned_metadata_GSE44971.tsv
        ├── gene_results_GSE44971.tsv
        ├── metadata_GSE19578.tsv
        └── metadata_GSE44971.tsv

26 directories, 63 files

training-data directory. This is where shared data files for modules live.

$tree -L 2 /shared/data/training-data/
/shared/data/training-data/
├── NB_cell_line_tximport.RDS
├── SRR585570
│   ├── aux_info
│   ├── cmd_info.json
│   ├── libParams
│   ├── lib_format_counts.json
│   ├── logs
│   └── quant.sf
├── darmanis
│   ├── darmanis_metadata.tsv
│   ├── qc_reports
│   ├── salmon_quant
│   ├── salmon_quant_untrimmed
│   ├── sample_list.csv
│   ├── tximport
│   └── tximport_untrimmed
├── gastric_cancer_tximport.RDS
├── machine-learning
│   ├── data
│   └── models
├── pathway-analysis
│   └── data
└── tabula-muris
    ├── TM_droplet_metadata.csv
    ├── alevin
    ├── bam
    ├── fastq
    ├── normalized
    └── qc-reports

21 directories, 8 files
jashapiro commented 3 years ago

Most of the content of this is covered in #327 & #339, though there may be later changes as well.