TCP-Lab / SeqLoader

Constructors and methods for `xSeries` and `xModel` S3 classes
0 stars 0 forks source link

SeqLoader Package

Constructors and methods for xSeries and xModel S3 classes

Motivation

The S3 xSeries class attempts to provide a convenient R representation of a typical RNA-Seq output consisting of a series of sequencing runs, usually each of them referring to a different RNA sample, from a different biological source. Building on this, the S3 xModel class addresses the need to group several independent scientific studies for the purpose of meta-analysis or reanalysis. To describe this hierarchical data structure, we will use a terminology primarily modeled after the one adopted by the partners of the International Nucleotide Sequence Database Collaboration (INSDC) in the context of the Sequence Read Archive (SRA). See here for a friendly explanation of the INSDC and the related SRA data model.

Hierarchy and Terminology

  1. Run: each sequencing run directly generating a single FASTQ file (or file pair in the case of non-interleaved PE reads). Within the INSDC framework, Run metadata objects are referred to by an accession following the pattern (E|D|S)RR[0-9]{6,}.
  2. Sample: the set of all Runs from the same biological RNA sample. The corresponding INSDC metadata objects are Samples (accession (E|D|S)RS[0-9]{6,}) or, equivalently, BioSamples (accession SAM(E|D|N)[A-Z]?[0-9]+). Even GEO Sample IDs (accession GSM[0-9]+) can be found if data were originally brokered to INSDC’s SRA by GEO.
  3. Series: the set of all Samples/Runs pertaining to a given biological research project or experimental design, across all conditions of interest to that particular study. This is roughly what INSDC refers to as Study (accession (E|D|S)RP[0-9]{6,}) or, equivalently, BioProject (accession PRJ(E|D|N)[A-Z][0-9]+). If data were originally brokered by GEO or ArrayExpress, GEO Series IDs (accession GSE[0-9]+) or AE Experiment IDs (accession E-[A-Z]{4}-[0-9]+) will be also present in SRA DBs and can be used as Study alias accessions, respectively.
  4. Model: a collection of Series dealing with the same biological model. Since this object is characteristic of the meta-analytic level, which by definition involves multiple independent studies, it has no counterpart in INSDC.

[!NOTE] Most of the times, Runs and Samples are the same thing, however it could happen that a single RNA Sample is sequenced through multiple Runs (also referred to as 'technical replicates'). Thus, the only ID that is ensured to be unique on a per-file basis (even in the case of technical replicates), is the Run accession. For this reason, it will be used here as the base reference for the construction of xSeries and xModel data structures.

SeqLoader Package

General

SeqLoader implements S3 xSeries and xModel classes to provide an integrated representation of RNA-Seq data (i.e., read counts), gene annotation, and Sample/Run metadata for Series and Models, respectively.

Both new_xSeries and new_xModel SeqLoader constructors assume that RNA-Seq low-level analysis (i.e., read alignment and transcript abundance quantification) has already been performed and that the starting point is a matrix of raw or normalized counts and their related metadata. This means that xSeries and xModel objects can only be constructed from files containing such information and not by direct data input.

With the exclusion of a few methods designed to filter and subset structures ([...], pruneRuns, keepRuns), all the other methods of the class access the data structures in read-only mode, leaving the object completely unchanged. SeqLoader is indeed designed to facilitate high-level analysis operations (i.e., downstream of the count matrix) such as data representation or descriptive or inferential statistical analysis, in both absolute and differential expression contexts.

xSeries objects

These S3 objects are constructed from a gene expression matrix and a metadata table related to a single study. Within the expression matrix, each row represents a gene (or transcript), while each column refers to a different sequencing Run, with the possible addition of one or more columns for gene annotations (e.g., gene symbol, gene name, gene type, ...). In contrast, the metadata table is assumed to dedicate one row for each Run and one column for each metadata of interest, among which a mandatory field for Run accessions. So, xSeries objects combine information from a typical gene expression matrix (possibly including gene annotations) with metadata about individual Runs.

More technically, an xSeries object is a named list containing one element for each Run of the study in question, and an additional element for gene annotations (which are common to all Runs). Each Run element is itself a list containing relevant metadata for the specific Run, and a gene data frame with raw or normalized counts for each gene. The annotation element, on the contrary, is a simple data frame (see tree graph below). Notice that, Samples are not explicitly represented in xSeries objects. However, a method will be implemented to collapse Runs to the Samples they belong in the case of technical replicates (see issue #4).

[!IMPORTANT] For the purpose of a data re-analysis or meta-analysis, one may not necessarily be interested in all the samples or experimental conditions included in the the original study. Rather, just a selection of them is usually considered. For this reason, recomputed count matrices could feature less columns (i.e., Runs) compared to the number of rows within the complete metadata table. By construction, xSeries objects are made up of all Runs for which there exist a metadata entry (i.e., usually all the Runs from the original study), but only the ones included in the loaded count matrices will also feature a counts column in the genes data frame. In any case, one can always use the pruneRuns method to easily remove all the Runs with no count data from xSeries or xModel objects, reducing them to the sole columns actually present in the count matrix used for object construction.

xModel objects

Structurally speaking, these S3 objects are just collections (lists) of many xSeries related to the same biological model. Rather, the focus here is on functions. Methods implementing generics for this class are primarily meant to aid in meta-analytic data synthesis (see geneStats in particular).

Here is a graph of a generic xModel made up of n xSeries. The n-th Series consists of m Runs and the gene annotation element. Each Run contains a gene dataframe for the counts and an arbitrary number of metadata, including the mandatory ena_run ID.

Model    : xModel
 │
 ├──$ Series_1    : xSeries
 ├──$ Series_2    : xSeries
 ├──$ ...
 └──$ Series_n    : xSeries
    │
    ├──$ Run_1       : list
    ├──$ Run_2       : list
    ├──$ Run_3       : list
    ├──$ ...
    ├──$ Run_m       : list
    │  │
    │  ├──$ ena_sample_title  : chr
    │  ├──$ geo_series        : chr
    │  ├──$ geo_sample        : chr
    │  ├──$ ena_project       : chr
    │  ├──$ ena_sample        : chr
    │  ├──$ ena_run           : chr
    │  ├──$ read_count        : int
    │  ├──$ library_layout    : chr
    │  ├──$ extra             : int
    │  ├──$ ...
    │  └──$ genes             : data.frame
    │     ├──$ IDs               : chr
    │     └──$ counts            : num
    │
    └──$ annotation  : data.frame
       ├──$ IDs         : chr
       ├──$ SYMBOL      : chr
       ├──$ GENENAME    : chr
       ├──$ GENETYPE    : chr 
       └──$ ...

Formal Requirements

A number of (hopefully reasonable) assumptions upon file and data organization are made for object construction to be successful.

  1. Each Series (or Study) is represented by two CSV or TSV files, containing respectively the read counts (along with possible gene annotation) and metadata for all the Runs making up the Series.
  2. All Series related to the same Model are stored in the same directory, here referred to as the target directory for xModel construction. Consistently, each Model is expected to live in a separate target directory.
  3. Both data and metadata files have names starting with the same Series accession ID (any of the INSDC BioProject, ENA Study, or GEO Series format is fine), followed by an underscore (_) and either the distinctive string CountMatrix or meta (ignoring case) if they are counts or metadata, respectively. Any additional characters after this pattern are allowed, as long as the terminal filename extension is either CSV or TSV (again ignoring case). Examples of well-formed filenames are:
    GSE205739_CountMatrix_genes_TPM.tsv
    PRJNA847413_countMATRIX.CSV
    PRJNA847413_meta.csv
    SRP379268_MeTaDaTa.TSV
  4. Metadata tables have proper column names as header, among which only ena_run is mandatory, as the only ID that is certainly unique to each FASTQ file in the original study (i.e., Series).
  5. Count tables have proper column names as header, including a mandatory gene ID column matching the regex gene.*id|transcript.*id|ENSEMBL|ENSEMBLTRANS (usually being gene_id or transcript_id).
  6. Within the count table, each count-containing column is named using the corresponding ena_run ID in the header (along with other possible text strings).
  7. Input counts from CountMatrix file can be either raw or normalized counts, but in any case they are supposed to be in linear scale (not log-transformed)

[!TIP] Most of these requirements are automatically met if the low-level RNA-Seq data analysis is performed using x.FASTQ.