Open ghost opened 2 years ago
hi, even if I am not the author of this package, I think this is a more general question. You have to download the sample attribute manifest file, find which sample ids belong to the brain tissues, and then extract those samples from your gene expression file.
Hi, we moved the software to netZooR, we will be happy to help you here: https://github.com/netZoo/netZooR
I don't know if you've looked into the netZooR package that @marouenbg mentioned (I didn't know that support had transitioned to netZooR until I read this comment today), but yes @Xenophong is correct that you need the sample attribute data. For version 8 it would be: https://storage.googleapis.com/adult-gtex/annotations/v8/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt
You can then look at column SMTSD and grepl for "Brain" (at least that is what I did), and then match the SAMPIDs in the phenodata to the column names of the gene reads data.
I just downloaded the files from GTEx to local and read them in that way. Most of this is directly from yarn::downloadGTEx but I had to modify some things.
# Reading in Sample Phenotype Data
# Data should contain:
#' SAMPID
#' SMTSD
pd <- read.delim(file.path(data.dir, samp_pheno))
message("Sample Phenotype Found")
# Set as matrix and get row names
pd <- as.matrix(pd)
rownames(pd) <- pd[, "SAMPID"]
# Subset if tissue subset given
tissue_subset <- "Brain"
# Look for any matches on the tissue subset
subset_rows <- which(grepl(paste0(tolower(tissue_subset), collapse = "|"), tolower(pd[, 'SMTSD'])))
pd <- pd[subset_rows, ]
options <- unique(pd[, 'SMTSD'])
message("Unique Tissues:")
message(paste0(options, collapse = '\n'))
# Get IDs
ids <- sapply(strsplit(pd[, "SAMPID"], "-"), function(i) paste(i[1:2],
collapse = "-"))
# Read in the counts data
#' Assuming data is in .gct format
message("Reading in count data")
cnts_raw <- data.table::fread(file = file.path(data.dir, gene_file))
print(dim(cnts_raw))
# Get unique samples
cnt_names <- unique(names(cnts_raw))
# Join samples with sample phenotype data
all <- data.frame(SAMPID = cnt_names) %>%
left_join(
as.data.frame(pd[, c("SAMPID","SMTSD")]), by = 'SAMPID'
)
# Subset to tissue subset if provided
subset_samples <- all %>%
filter(.,grepl(paste0(tolower(tissue_subset), collapse = '|'), tolower(SMTSD))) %>%
.$SAMPID
# Need to change dashes to "." for comparing against pd data
subset_samples <- gsub("\\-", ".", subset_samples)
subset_samples <- subset_samples[grepl("GTEX", subset_samples)]
# Subset to given subset
names(cnts_raw) <- gsub("\\-", "\\.", names(cnts_raw))
cnts <- cnts_raw[,c("Name", "Description", subset_samples), with = FALSE]
names(cnts) <- gsub("\\.", "\\-", names(cnts))
# Get unique genes
genes <- unlist(cnts[, 1])
geneNames <- unlist(cnts[, 2])
# Get the full names for those genes
counts <- cnts[, -c(1:3)]
counts <- as.matrix(counts)
mode(counts) <- "integer"
# Set the names of the counts to those genes
rownames(counts) <- genes
Hi @gmillerscripps , Apologies for the confusion, yarn is not maintained in netZooR but rather just called as a dependency. I will have to check which repo is the source code for bioconductor ad will let you know asap. Best, Marouen
Hi @gmillerscripps and al, This took too long but we now moved YARN in netZooR to maintain it long term https://github.com/netZoo/netZooR/pull/314 Please let me know if you want anything changed/updated.
Marouen
Hello, I have downloaded GTEx rna-seq data from this site: wget https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct.gz I want to extract only brain tissue gene expression data from this file. I checked yarn package and does it has any function that can perform this function? checkTissuesToMerge(obj, majorGroups, minorGroups, filterFun = NULL,plotFlag = TRUE, ...) But this function will merge tissue based on gene expression file. I just specifically want to extract brain tissue gene expression data only. Can you please provide me some guidance on how to proceed with this? Thank you