xSeries
and xModel
S3 classesThe S3 xSeries
class attempts to provide a convenient R representation of a
typical RNA-Seq output consisting of a series of sequencing runs, usually each
of them referring to a different RNA sample, from a different biological source.
Building on this, the S3 xModel
class addresses the need to group several
independent scientific studies for the purpose of meta-analysis or reanalysis.
To describe this hierarchical data structure, we will use a terminology
primarily modeled after the one adopted by the partners of the International
Nucleotide Sequence Database Collaboration (INSDC) in the context of the
Sequence Read Archive (SRA).
See here for a
friendly explanation of the INSDC and the related SRA data model.
(E|D|S)RR[0-9]{6,}
.(E|D|S)RS[0-9]{6,}
) or, equivalently, BioSamples (accession
SAM(E|D|N)[A-Z]?[0-9]+
). Even GEO Sample IDs (accession GSM[0-9]+
) can
be found if data were originally brokered to INSDC’s SRA by GEO.(E|D|S)RP[0-9]{6,}
) or, equivalently, BioProject
(accession PRJ(E|D|N)[A-Z][0-9]+
). If data were originally brokered by GEO
or ArrayExpress, GEO Series IDs (accession GSE[0-9]+
) or AE Experiment
IDs (accession E-[A-Z]{4}-[0-9]+
) will be also present in SRA DBs and can
be used as Study alias accessions, respectively.[!NOTE] Most of the times, Runs and Samples are the same thing, however it could happen that a single RNA Sample is sequenced through multiple Runs (also referred to as 'technical replicates'). Thus, the only ID that is ensured to be unique on a per-file basis (even in the case of technical replicates), is the Run accession. For this reason, it will be used here as the base reference for the construction of xSeries and xModel data structures.
SeqLoader implements S3 xSeries
and xModel
classes to provide an
integrated representation of RNA-Seq data (i.e., read counts), gene annotation,
and Sample/Run metadata for Series and Models, respectively.
Both new_xSeries
and new_xModel
SeqLoader constructors assume that
RNA-Seq low-level analysis (i.e., read alignment and transcript abundance
quantification) has already been performed and that the starting point is a
matrix of raw or normalized counts and their related metadata. This means that
xSeries
and xModel
objects can only be constructed from files containing
such information and not by direct data input.
With the exclusion of a few methods designed to filter and subset structures
([...]
, pruneRuns
, keepRuns
), all the other methods of the class access
the data structures in read-only mode, leaving the object completely unchanged.
SeqLoader is indeed designed to facilitate high-level analysis operations
(i.e., downstream of the count matrix) such as data representation or
descriptive or inferential statistical analysis, in both absolute and
differential expression contexts.
xSeries
objectsThese S3 objects are constructed from a gene expression matrix and a metadata
table related to a single study. Within the expression matrix, each row
represents a gene (or transcript), while each column refers to a different
sequencing Run, with the possible addition of one or more columns for gene
annotations (e.g., gene symbol, gene name, gene type, ...). In contrast, the
metadata table is assumed to dedicate one row for each Run and one column for
each metadata of interest, among which a mandatory field for Run accessions.
So, xSeries
objects combine information from a typical gene expression matrix
(possibly including gene annotations) with metadata about individual Runs.
More technically, an xSeries
object is a named list containing one element for
each Run of the study in question, and an additional element for gene
annotations (which are common to all Runs). Each Run element is itself a
list containing relevant metadata for the specific Run, and a gene
data
frame with raw or normalized counts for each gene. The annotation
element, on
the contrary, is a simple data frame (see tree graph below). Notice that,
Samples are not explicitly represented in xSeries
objects. However, a method
will be implemented to collapse Runs to the Samples they belong in the case
of technical replicates (see issue #4).
[!IMPORTANT] For the purpose of a data re-analysis or meta-analysis, one may not necessarily be interested in all the samples or experimental conditions included in the the original study. Rather, just a selection of them is usually considered. For this reason, recomputed count matrices could feature less columns (i.e., Runs) compared to the number of rows within the complete metadata table. By construction,
xSeries
objects are made up of all Runs for which there exist a metadata entry (i.e., usually all the Runs from the original study), but only the ones included in the loaded count matrices will also feature acounts
column in thegenes
data frame. In any case, one can always use thepruneRuns
method to easily remove all the Runs with no count data from xSeries or xModel objects, reducing them to the sole columns actually present in the count matrix used for object construction.
xModel
objectsStructurally speaking, these S3 objects are just collections (lists) of many
xSeries
related to the same biological model. Rather, the focus here is on
functions. Methods implementing generics for this class are primarily meant to
aid in meta-analytic data synthesis (see geneStats
in particular).
Here is a graph of a generic xModel made up of n xSeries. The n-th
Series consists of m Runs and the gene annotation
element. Each Run
contains a gene
dataframe for the counts
and an arbitrary number of
metadata, including the mandatory ena_run
ID.
Model : xModel
│
├──$ Series_1 : xSeries
├──$ Series_2 : xSeries
├──$ ...
└──$ Series_n : xSeries
│
├──$ Run_1 : list
├──$ Run_2 : list
├──$ Run_3 : list
├──$ ...
├──$ Run_m : list
│ │
│ ├──$ ena_sample_title : chr
│ ├──$ geo_series : chr
│ ├──$ geo_sample : chr
│ ├──$ ena_project : chr
│ ├──$ ena_sample : chr
│ ├──$ ena_run : chr
│ ├──$ read_count : int
│ ├──$ library_layout : chr
│ ├──$ extra : int
│ ├──$ ...
│ └──$ genes : data.frame
│ ├──$ IDs : chr
│ └──$ counts : num
│
└──$ annotation : data.frame
├──$ IDs : chr
├──$ SYMBOL : chr
├──$ GENENAME : chr
├──$ GENETYPE : chr
└──$ ...
A number of (hopefully reasonable) assumptions upon file and data organization are made for object construction to be successful.
CSV
or TSV
files,
containing respectively the read counts (along with possible gene
annotation) and metadata for all the Runs making up the Series._
) and either the distinctive string
CountMatrix
or meta
(ignoring case) if they are counts or metadata,
respectively. Any additional characters after this pattern are allowed, as
long as the terminal filename extension is either CSV
or TSV
(again
ignoring case). Examples of well-formed filenames are:
GSE205739_CountMatrix_genes_TPM.tsv
PRJNA847413_countMATRIX.CSV
PRJNA847413_meta.csv
SRP379268_MeTaDaTa.TSV
ena_run
is mandatory, as the only ID that is certainly unique to each FASTQ
file in the original study (i.e., Series).gene.*id|transcript.*id|ENSEMBL|ENSEMBLTRANS
(usually being gene_id
or transcript_id
).ena_run
ID in the header (along with other possible text
strings).CountMatrix
file can be either raw or normalized counts,
but in any case they are supposed to be in linear scale (not log-transformed)[!TIP] Most of these requirements are automatically met if the low-level RNA-Seq data analysis is performed using x.FASTQ.