QuackenbushLab / yarn

YARN: Robust Multi-Tissue RNA-Seq Preprocessing and Normalization
13 stars 2 forks source link

extract brain tissue from gtex gene count #5

Open ghost opened 2 years ago

ghost commented 2 years ago

Hello, I have downloaded GTEx rna-seq data from this site: wget https://storage.googleapis.com/gtex_analysis_v8/rna_seq_data/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct.gz I want to extract only brain tissue gene expression data from this file. I checked yarn package and does it has any function that can perform this function? checkTissuesToMerge(obj, majorGroups, minorGroups, filterFun = NULL,plotFlag = TRUE, ...) But this function will merge tissue based on gene expression file. I just specifically want to extract brain tissue gene expression data only. Can you please provide me some guidance on how to proceed with this? Thank you

Xenophong commented 2 years ago

hi, even if I am not the author of this package, I think this is a more general question. You have to download the sample attribute manifest file, find which sample ids belong to the brain tissues, and then extract those samples from your gene expression file.

marouenbg commented 11 months ago

Hi, we moved the software to netZooR, we will be happy to help you here: https://github.com/netZoo/netZooR

gmillerscripps commented 11 months ago

I don't know if you've looked into the netZooR package that @marouenbg mentioned (I didn't know that support had transitioned to netZooR until I read this comment today), but yes @Xenophong is correct that you need the sample attribute data. For version 8 it would be: https://storage.googleapis.com/adult-gtex/annotations/v8/GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt

You can then look at column SMTSD and grepl for "Brain" (at least that is what I did), and then match the SAMPIDs in the phenodata to the column names of the gene reads data.

I just downloaded the files from GTEx to local and read them in that way. Most of this is directly from yarn::downloadGTEx but I had to modify some things.



  # Reading in Sample Phenotype Data
  # Data should contain:
  #' SAMPID
  #' SMTSD
  pd <- read.delim(file.path(data.dir, samp_pheno))

  message("Sample Phenotype Found")

  # Set as matrix and get row names
  pd <- as.matrix(pd)
  rownames(pd) <- pd[, "SAMPID"]

  # Subset if tissue subset given
tissue_subset <- "Brain"

    # Look for any matches on the tissue subset
    subset_rows <- which(grepl(paste0(tolower(tissue_subset), collapse  = "|"), tolower(pd[, 'SMTSD'])))
    pd <- pd[subset_rows, ]

  options <- unique(pd[, 'SMTSD']) 
  message("Unique Tissues:")
  message(paste0(options, collapse = '\n'))

  # Get IDs
  ids <- sapply(strsplit(pd[, "SAMPID"], "-"), function(i) paste(i[1:2], 
                                                                 collapse = "-"))

  # Read in the counts data
  #' Assuming data is in .gct format
  message("Reading in count data") 

  cnts_raw <- data.table::fread(file = file.path(data.dir, gene_file))
  print(dim(cnts_raw))
  # Get unique samples
  cnt_names <- unique(names(cnts_raw))

  # Join samples with sample phenotype data
  all <- data.frame(SAMPID = cnt_names) %>%
    left_join(
      as.data.frame(pd[, c("SAMPID","SMTSD")]), by = 'SAMPID'
    )

  # Subset to tissue subset if provided
  subset_samples <- all %>%
      filter(.,grepl(paste0(tolower(tissue_subset), collapse = '|'), tolower(SMTSD)))   %>%
    .$SAMPID

  # Need to change dashes to "." for comparing against pd data
  subset_samples <- gsub("\\-", ".", subset_samples)
  subset_samples <- subset_samples[grepl("GTEX", subset_samples)]

  # Subset to given subset    
  names(cnts_raw) <- gsub("\\-", "\\.", names(cnts_raw))
  cnts <- cnts_raw[,c("Name", "Description", subset_samples), with = FALSE]

  names(cnts) <- gsub("\\.", "\\-", names(cnts))

  # Get unique genes
  genes <- unlist(cnts[, 1])
  geneNames <- unlist(cnts[, 2])

  # Get the full names for those genes
  counts <- cnts[, -c(1:3)]

  counts <- as.matrix(counts)
  mode(counts) <- "integer"

  # Set the names of the counts to those genes
  rownames(counts) <- genes
marouenbg commented 11 months ago

Hi @gmillerscripps , Apologies for the confusion, yarn is not maintained in netZooR but rather just called as a dependency. I will have to check which repo is the source code for bioconductor ad will let you know asap. Best, Marouen

marouenbg commented 7 months ago

Hi @gmillerscripps and al, This took too long but we now moved YARN in netZooR to maintain it long term https://github.com/netZoo/netZooR/pull/314 Please let me know if you want anything changed/updated.

Marouen