Open jashapiro opened 4 years ago
Here's an abbreviated version of how we handle file paths for RNA-seq
:
.
├── 00a-reproducibility_cmdline.md
├── 00b-additional_approaches.md
├── 01-qc_trim_quant.html
├── 01-qc_trim_quant.md
├── 02-gastric_cancer_tximport-live.Rmd
├── 02-gastric_cancer_tximport-live.nb.html
├── 03-gastric_cancer_exploratory-live.Rmd
├── 03-gastric_cancer_exploratory-live.nb.html
├── 03b-exploratory_data_analysis_exercise.Rmd
├── 04-nb_cell_line_tximport.md
├── 05-nb_cell_line_DESeq2-live.Rmd
├── 05-nb_cell_line_DESeq2-live.nb.html
├── 06-bulk_rnaseq_exercise.Rmd
├── RNA-seq.Rproj
├── data
│ ├── fastq
│ │ └── gastric_cancer
│ │ └── SRR585570
│ │ ├── SRR585570_1.fastq.gz -> /shared/data/training-modules/RNA-seq/data/fastq/gastric_cancer/SRR585570/SRR585570_1.fastq.gz
│ │ ├── SRR585570_2.fastq.gz -> /shared/data/training-modules/RNA-seq/data/fastq/gastric_cancer/SRR585570/SRR585570_2.fastq.gz
│ │ ├── SRR585570_fastp_1.fastq.gz
│ │ └── SRR585570_fastp_2.fastq.gz
├── index
│ └── Homo_sapiens
│ ├── Homo_sapiens.GRCh38.95_tx2gene.tsv -> /shared/data/reference/tx2gene/Homo_sapiens.GRCh38.95_tx2gene.tsv
│ └── short_index
│ ├── complete_ref_lens.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//complete_ref_lens.bin
│ ├── ctable.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//ctable.bin
│ ├── ctg_offsets.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//ctg_offsets.bin
│ ├── duplicate_clusters.tsv -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//duplicate_clusters.tsv
│ ├── info.json -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//info.json
│ ├── mphf.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//mphf.bin
│ ├── pos.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//pos.bin
│ ├── pre_indexing.log -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//pre_indexing.log
│ ├── rank.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//rank.bin
│ ├── refAccumLengths.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//refAccumLengths.bin
│ ├── ref_indexing.log -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//ref_indexing.log
│ ├── reflengths.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//reflengths.bin
│ ├── refseq.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//refseq.bin
│ ├── seq.bin -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//seq.bin
│ └── versionInfo.json -> /shared/data/reference/refgenie/hg38_cdna/salmon_index/short/short//versionInfo.json
├── plots
│ └── NB_cell_line_heatmap.png
├── results
│ ├── NB_cell_line_DESeq_amplified_v_nonamplified.RDS
│ └── NB_cell_line_DESeq_amplified_v_nonamplified_results.tsv
└── scripts
└── run_SRR585570.sh
which is more similar to the second option you describe. This was set up, in part, to reduce the modifications to paths we would have to make for virtual vs. in-person workshops that were run on participants' own laptops. That fact is not a vote to set things up this way, but I do like the idea of keeping things similar to what a real project would be like. Setting up symlinks does seem like it may take more time/effort, though.
I think that was the way I was leaning, with the difference that I will link at the directory level, rather than separately to each file within the index, for example.
I think that was the way I was leaning, with the difference that I will link at the directory level, rather than separately to each file within the index, for example.
And that's what the raw
subfolder helps with!
I wanted to document the current 2020-july
directory structure as we head into training here:
$ tree /shared/data/training-modules/2020-july/
/shared/data/training-modules/2020-july/
├── machine-learning
│ ├── 01-openpbta_heatmap-live.Rmd
│ ├── 02-openpbta_consensus_clustering-live.Rmd
│ ├── 03-openpbta_PLIER-live.Rmd
│ ├── 04-openpbta_plot_LV-live.Rmd
│ ├── 05-machine_learning_exercise.Rmd
│ ├── data
│ │ ├── GSE116436 -> /shared/data/training-data/machine-learning/data/GSE116436/
│ │ └── open-pbta
│ │ ├── download -> /shared/data/training-data/machine-learning/data/open-pbta/download
│ │ └── processed
│ │ ├── pbta-histologies-stranded-rnaseq.tsv -> /shared/data/training-data/machine-learning/data/open-pbta/processed/pbta-histologies-stranded-rnaseq.tsv
│ │ └── pbta-vst-stranded.tsv -> /shared/data/training-data/machine-learning/data/open-pbta/processed/pbta-vst-stranded.tsv
│ ├── diagrams
│ │ ├── mao_nature_methods_fig1.png
│ │ ├── monti_gaussian3_cdf_delta.png
│ │ ├── monti_gaussian3_consensus_matrix.png
│ │ └── monti_gaussian5.png
│ ├── machine-learning.Rproj
│ ├── models
│ │ ├── NCI60_PLIER_model.RDS -> /shared/data/training-data/machine-learning/models/NCI60_PLIER_model.RDS
│ │ └── NCI60_prcomp_results.RDS -> /shared/data/training-data/machine-learning/models/NCI60_prcomp_results.RDS
│ └── setup
│ ├── 00-data-download.sh
│ ├── 01-transform-rnaseq.Rmd
│ ├── 01-transform-rnaseq.nb.html
│ └── README.md
├── pathway-analysis
│ ├── 01-overrepresentation_analysis-live.Rmd
│ ├── 02-gene_set_enrichment_analysis-live.Rmd
│ ├── 03-gene_set_variation_analysis-live.Rmd
│ ├── 04-pathway_analysis_exercise.Rmd
│ ├── data -> /shared/data/training-data/pathway-analysis/data/
│ ├── diagrams
│ │ ├── hanzelmann_fig1.jpg
│ │ └── subramanian_fig1.jpg
│ ├── pathway-analysis.Rproj
│ ├── results
│ │ └── gene-metrics
│ │ ├── nb_cell_line_mycn_amplified_v_nonamplified.tsv
│ │ └── pdx_medulloblastoma_treatment_dge.tsv
│ └── setup
│ ├── 01-prepare_NB_cell_line.Rmd
│ ├── 01-prepare_NB_cell_line.nb.html
│ ├── 02-prepare_openpbta_MB_data.Rmd
│ ├── 02-prepare_openpbta_MB_data.nb.html
│ └── README.md
└── scRNA-seq
├── 00-scRNA-seq_introduction.html
├── 00-scRNA-seq_introduction.md
├── 01-filtering_scRNA-seq-live.Rmd
├── 02-normalizing_scRNA-seq-live.Rmd
├── 03-scrnaseq_day1_exercise.Rmd
├── 04-tag-based_scRNA-seq_processing-live.Rmd
├── 05-dimension_reduction_scRNA-seq-live.Rmd
├── 06-scrnaseq_day2_exercise.Rmd
├── README.md
├── data
│ ├── glioblastoma
│ │ ├── hs_mitochondrial_genes.tsv
│ │ └── preprocessed -> /shared/data/training-data/darmanis
│ └── tabula-muris
│ ├── TM_droplet_metadata.csv
│ ├── fastq-raw -> /shared/data/training-data/tabula-muris/fastq
│ ├── mm_ensdb95_tx2gene.tsv
│ ├── mm_mitochondrial_genes.tsv
│ └── normalized
│ └── TM_normalized.rds -> /shared/data/training-data/tabula-muris/normalized/TM_normalized.rds
├── diagrams
│ ├── full-length_1.png
│ ├── full-length_2.png
│ ├── full-length_3.png
│ ├── glioblastoma_dir_structure.png
│ ├── overview_workflow.png
│ ├── tag-based_1.png
│ ├── tag-based_2.png
│ └── tag-based_3.png
├── figures
│ ├── gbm_figure.jpg
│ ├── pca_tabula_muris.png
│ └── sce_structure.png
├── pca_tabula_muris.png
├── qc-reports
│ └── Bad_Example_10X_P4_2_alevinqc.html
└── scRNA-seq.Rproj
25 directories, 60 files
magic
directory, where "cooking show magic" files and backups for processed data live.
$ tree /shared/data/training-modules/magic/
/shared/data/training-modules/magic/
├── 2020-july
│ ├── machine-learning
│ │ └── models
│ │ └── pbta-medulloblastoma-plier.RDS
│ └── scRNA-seq
│ ├── analysis
│ │ └── glioblastoma
│ │ └── markers
│ │ ├── Astocyte_markers.tsv
│ │ ├── Immune cell_markers.tsv
│ │ ├── Neoplastic_markers.tsv
│ │ ├── Neuron_markers.tsv
│ │ ├── OPC_markers.tsv
│ │ ├── Oligodendrocyte_markers.tsv
│ │ └── Vascular_markers.tsv
│ ├── data
│ │ ├── glioblastoma
│ │ │ ├── filtered
│ │ │ │ └── filtered_count_matrix.tsv
│ │ │ ├── hs_mitochondrial_genes.tsv
│ │ │ └── normalized
│ │ │ ├── glioblastoma_sce.RDS
│ │ │ └── scran_norm_gene_matrix.tsv
│ │ └── tabula-muris
│ │ ├── TM_droplet_metadata.csv
│ │ ├── alevin-quant
│ │ │ ├── 10X_P4_3
│ │ │ │ ├── alevin
│ │ │ │ │ ├── alevin.log
│ │ │ │ │ ├── featureDump.txt
│ │ │ │ │ ├── predictions.txt
│ │ │ │ │ ├── quants_mat.gz
│ │ │ │ │ ├── quants_mat_cols.txt
│ │ │ │ │ ├── quants_mat_rows.txt
│ │ │ │ │ ├── quants_tier_mat.gz
│ │ │ │ │ ├── raw_cb_frequency.txt
│ │ │ │ │ └── whitelist.txt
│ │ │ │ ├── aux_info
│ │ │ │ │ ├── alevin_meta_info.json
│ │ │ │ │ ├── ambig_info.tsv
│ │ │ │ │ ├── expected_bias.gz
│ │ │ │ │ ├── fld.gz
│ │ │ │ │ ├── meta_info.json
│ │ │ │ │ ├── observed_bias.gz
│ │ │ │ │ └── observed_bias_3p.gz
│ │ │ │ ├── cmd_info.json
│ │ │ │ ├── libParams
│ │ │ │ │ └── flenDist.txt
│ │ │ │ ├── lib_format_counts.json
│ │ │ │ └── logs
│ │ │ │ └── salmon_quant.log
│ │ │ └── 10X_P7_0
│ │ │ ├── alevin
│ │ │ │ ├── alevin.log
│ │ │ │ ├── featureDump.txt
│ │ │ │ ├── predictions.txt
│ │ │ │ ├── quants_mat.gz
│ │ │ │ ├── quants_mat_cols.txt
│ │ │ │ ├── quants_mat_rows.txt
│ │ │ │ ├── quants_tier_mat.gz
│ │ │ │ ├── raw_cb_frequency.txt
│ │ │ │ └── whitelist.txt
│ │ │ ├── aux_info
│ │ │ │ ├── alevin_meta_info.json
│ │ │ │ ├── ambig_info.tsv
│ │ │ │ ├── expected_bias.gz
│ │ │ │ ├── fld.gz
│ │ │ │ ├── meta_info.json
│ │ │ │ ├── observed_bias.gz
│ │ │ │ └── observed_bias_3p.gz
│ │ │ ├── cmd_info.json
│ │ │ ├── libParams
│ │ │ │ └── flenDist.txt
│ │ │ ├── lib_format_counts.json
│ │ │ └── logs
│ │ │ └── salmon_quant.log
│ │ ├── mm_ensdb95_tx2gene.tsv
│ │ └── mm_mitochondrial_genes.tsv
│ └── qc-reports
│ ├── 10X_P4_3_qc_report.html
│ └── Bad_Example_10X_P4_2_alevinqc.html
└── 2020-june
└── intro-to-R-tidyverse
├── GSE19578.tsv
├── GSE44971.tsv
├── cleaned_metadata_GSE44971.tsv
├── gene_results_GSE44971.tsv
├── metadata_GSE19578.tsv
└── metadata_GSE44971.tsv
26 directories, 63 files
training-data
directory. This is where shared data files for modules live.
$tree -L 2 /shared/data/training-data/
/shared/data/training-data/
├── NB_cell_line_tximport.RDS
├── SRR585570
│ ├── aux_info
│ ├── cmd_info.json
│ ├── libParams
│ ├── lib_format_counts.json
│ ├── logs
│ └── quant.sf
├── darmanis
│ ├── darmanis_metadata.tsv
│ ├── qc_reports
│ ├── salmon_quant
│ ├── salmon_quant_untrimmed
│ ├── sample_list.csv
│ ├── tximport
│ └── tximport_untrimmed
├── gastric_cancer_tximport.RDS
├── machine-learning
│ ├── data
│ └── models
├── pathway-analysis
│ └── data
└── tabula-muris
├── TM_droplet_metadata.csv
├── alevin
├── bam
├── fastq
├── normalized
└── qc-reports
21 directories, 8 files
Most of the content of this is covered in #327 & #339, though there may be later changes as well.
The current directory structure for the scRNA module set looks something like this:
There isn't anything inherently wrong with this, but with using the Rstudio server, we now plan to have the raw data stored in
shared/data/
to avoid unnecessary duplication.There are two options that I see: the first is to leave things mostly as they are, but remove the raw data and put the paths to the files directly in the notebooks. This would look something like this, using the current paths in
/shared/data
:Note that this uses
~/shared-data/
for the reason that this symlink that we set up for every user is visible in the RStudio file browser without having to use "Go to folder".The other option is to add moar symlinks within the module, so the new directory structure might look like this:
This would more closely mirror a real project, as the data would appear to be in the module folder and could be referred to that way. It can also encourage the practice that a data folder is for reading only, and anything you write goes in a separate location (though I am not too strict about this myself, as long as there is a
raw
subfolder of some kind withindata
).Thoughts on which of these two options is preferred, or modifications to the proposed structures?