d3b-center / hope-cohort-analysis

Analysis for HOPE cohort
3 stars 1 forks source link

Update merge files code #105

Open komalsrathi opened 2 weeks ago

komalsrathi commented 2 weeks ago

Quick update:

I first updated the histologies base file to add the two missing samples and soft-linked it under data/. Then, I re-ran the modified scripts (i.e. remove title case from CNV file, update path to gencode v39) to generate the merged files for v3 release.

In addition to the updated histologies file, I have updated and uploaded to s3 (v3 folder) the following merged files:

results
├── Hope-cnv-controlfreec-tumor-only.rds
├── Hope-cnv-controlfreec.rds
├── Hope-fusion-putative-oncogenic.rds
├── Hope-gene-counts-rsem-expected_count-collapsed.rds
├── Hope-gene-counts-rsem-expected_count.rds
├── Hope-gene-expression-rsem-tpm-collapsed.rds
├── Hope-gene-expression-rsem-tpm.rds
├── Hope-snv-consensus-plus-hotspots.maf.tsv.gz
├── Hope-tumor-only-snv-mutect2.maf.tsv.gz
└── md5sum.txt

For the md5sum.txt, I have only updated the md5sums for the above files generated by my merge script).

Here is the comparison of sample size between v2 and the above merged files (i.e. v3) - each file's sample size has increased by 2:

> # Counts
> counts_file = readRDS("data/Hope-gene-counts-rsem-expected_count-collapsed.rds")
> length(colnames(counts_file))
[1] 85

> counts_file = readRDS("analyses/merge-files/results/Hope-gene-counts-rsem-expected_count-collapsed.rds")
> length(colnames(counts_file))
[1] 87

> # TPM
> tpm_file = readRDS("data/Hope-gene-expression-rsem-tpm-collapsed.rds")
> length(colnames(tpm_file))
[1] 85

> tpm_file = readRDS("analyses/merge-files/results/Hope-gene-expression-rsem-tpm-collapsed.rds")
> length(colnames(tpm_file))
[1] 87

> # SNV
> snv_file <- data.table::fread("data/Hope-snv-consensus-plus-hotspots.maf.tsv.gz")
> length(unique(snv_file$Tumor_Sample_Barcode))
[1] 71

> snv_file <- data.table::fread("analyses/merge-files/results/Hope-snv-consensus-plus-hotspots.maf.tsv.gz")
> length(unique(snv_file$Tumor_Sample_Barcode))
[1] 73

> # SNV tumor-only 
> snv_tumor_only_file <- data.table::fread("data/Hope-tumor-only-snv-mutect2.maf.tsv.gz")
> length(unique(snv_tumor_only_file$Tumor_Sample_Barcode))
[1] 88

> snv_tumor_only_file <- data.table::fread("analyses/merge-files/results/Hope-tumor-only-snv-mutect2.maf.tsv.gz")
> length(unique(snv_tumor_only_file$Tumor_Sample_Barcode))
[1] 90

> # CNV
> cnv_file <- readRDS("data/Hope-cnv-controlfreec.rds")
> length(unique(cnv_file$Kids_First_Biospecimen_ID))
[1] 71

> cnv_file <- readRDS("analyses/merge-files/results/Hope-cnv-controlfreec.rds")
> length(unique(cnv_file$Kids_First_Biospecimen_ID))
[1] 73

> # CNV tumor-only
> cnv_tumor_only_file <- readRDS("data/Hope-cnv-controlfreec-tumor-only.rds")
> length(unique(cnv_tumor_only_file$Kids_First_Biospecimen_ID))
[1] 88

> cnv_tumor_only_file <- readRDS("analyses/merge-files/results/Hope-cnv-controlfreec-tumor-only.rds")
> length(unique(cnv_tumor_only_file$Kids_First_Biospecimen_ID))
[1] 90

> # Fusions
> fusion_file <- readRDS("data/Hope-fusion-putative-oncogenic.rds")
> length(unique(fusion_file$Sample))
[1] 85

> fusion_file <- readRDS("analyses/merge-files/results/Hope-fusion-putative-oncogenic.rds")
> length(unique(fusion_file$Sample))
[1] 87