Open LTLA opened 6 years ago
Interesting idea, I could have a think about how to implement it. The core of scPipe was originally pure C++, which is why everything tended to be file-based.
Fundamentally what we get out of the annotation is an std::unordered_map
from chromosome name to std::vector<Gene>
where Gene
is a class based on std::vector<Exon>
where the exons are anonymous and flattened to be non-overlapping. So I imagine I'd just have to figure out how to transfer the data into C++ and parse it into the desired format.
The lazier solution could be just to write a wrapper using tempfiles to do exactly what you described but in the background and automatically cleaned up. I'd imagine it's much slower, has a bit more disk usage overhead, but potentially negligible in the context of the remainder of the pipeline.
I imagine that this would actually be fairly easy, if you're willing to take I/O out of C++. Specifically, you can rtracklayer::import
a file into a GRanges
, and then provide the necessary details to the C++ code (e.g., start, end, gene id) as fully fledged arguments. This has several advantages:
GRanges
object. This guessing behaviour was quite frustrating when supplying a manually constructed annotation file.It is true that reading the annotation file into R will increase memory usage; but you already have to load the entire file into memory anyway to create your equivalent C++ structure, and any memory overhead from annotation is likely to be of fixed size and negligible (compared to the sequencing data itself).
In my packages, I use R to do all of my I/O except for things related to heavy lifting from the BAM file. This is one considerable advantage in writing R packages versus writing standalone C++ binaries.
+1 for looking into dumping responsibility for file parsing onto rtracklayer.
For example, I just had a segfault when I accidentally passed sc_exon_mapping()
the GENCODE GTF rather than the GFF3 file.
yes. I don't have a clear overall design at first. I wrote the C++ part first as a stand-alone command line tools, and decide to build an R package, later. so the design is not very R centric. they are inconvenient in R and I should add those features in.
BTW, what are your thoughts on Rcpp conversion VS direct SEXP? I feel you use both.
Use Rcpp. I switched a few years ago and I haven't looked back. From my perspective:
ALTREP
, such as the fact that consecutive indices or repeats of the same value are not fully instantiated in memory. While it's true that INTEGER()
and other macros would probably expand these constructs into contiguous vectors, those will be operations that cause memory allocations (and thus trigger the garbage collector), which will break code that previously treated read-only macros like INTEGER
as no-allocation operations.PROTECT
ion out of our hands. This is worth the price of admission by itself. There are some edge cases involving RNGscope
, but it otherwise works as expected.BEGIN_RCPP
and END_RCPP
, which means that you can safely throw
in C++ to trigger an R-level error without segmentation faults.There's probably a whole bunch of other things I've taken for granted. Long story short, for R package development, Rcpp has definitely been a plus. Or a plus-plus, so to speak.
Had a look using rtracklayer::import()
unfortunately the issues that made me write the source specific code still persist, though it may alleviate some complexity with regards to GFF3/GTF/BED.
> ensembl_anno[17:22, ]
GRanges object with 6 ranges and 24 metadata columns:
seqnames ranges strand | source type score
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric>
[1] 1 11869-14409 + | havana pseudogene <NA>
[2] 1 11869-14409 + | havana lnc_RNA <NA>
[3] 1 11869-12227 + | havana exon <NA>
[4] 1 12613-12721 + | havana exon <NA>
[5] 1 13221-14409 + | havana exon <NA>
[6] 1 12010-13670 + | havana pseudogenic_transcript <NA>
phase ID Alias external_name logic_name
<integer> <character> <CharacterList> <character> <character>
[1] <NA> gene:ENSG00000223972 <NA> <NA> havana
[2] <NA> transcript:ENST00000456328 <NA> <NA> <NA>
[3] <NA> <NA> <NA> <NA> <NA>
[4] <NA> <NA> <NA> <NA> <NA>
[5] <NA> <NA> <NA> <NA> <NA>
[6] <NA> transcript:ENST00000450305 <NA> <NA> <NA>
gene_id version Parent tag
<character> <character> <CharacterList> <character>
[1] ENSG00000223972 5 <NA> <NA>
[2] <NA> 2 gene:ENSG00000223972 basic
[3] <NA> 1 transcript:ENST00000456328 <NA>
[4] <NA> 1 transcript:ENST00000456328 <NA>
[5] <NA> 1 transcript:ENST00000456328 <NA>
[6] <NA> 2 gene:ENSG00000223972 basic
> head(gencode_anno)
GRanges object with 6 ranges and 22 metadata columns:
seqnames ranges strand | source type score phase
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
[1] chr1 11869-14409 + | HAVANA gene <NA> <NA>
[2] chr1 11869-14409 + | HAVANA transcript <NA> <NA>
[3] chr1 11869-12227 + | HAVANA exon <NA> <NA>
[4] chr1 12613-12721 + | HAVANA exon <NA> <NA>
[5] chr1 13221-14409 + | HAVANA exon <NA> <NA>
[6] chr1 12010-13670 + | HAVANA transcript <NA> <NA>
ID gene_id gene_type
<character> <character> <character>
[1] ENSG00000223972.5 ENSG00000223972.5 transcribed_unprocessed_pseudogene
[2] ENST00000456328.2 ENSG00000223972.5 transcribed_unprocessed_pseudogene
[3] exon:ENST00000456328.2:1 ENSG00000223972.5 transcribed_unprocessed_pseudogene
[4] exon:ENST00000456328.2:2 ENSG00000223972.5 transcribed_unprocessed_pseudogene
[5] exon:ENST00000456328.2:3 ENSG00000223972.5 transcribed_unprocessed_pseudogene
[6] ENST00000450305.2 ENSG00000223972.5 transcribed_unprocessed_pseudogene
> head(refseq_anno)
GRanges object with 6 ranges and 96 metadata columns:
seqnames ranges strand | source type score phase
<Rle> <IRanges> <Rle> | <factor> <factor> <numeric> <integer>
[1] NC_000001.11 1-248956422 + | RefSeq region <NA> <NA>
[2] NC_000001.11 11874-14409 + | BestRefSeq pseudogene <NA> <NA>
[3] NC_000001.11 11874-14409 + | BestRefSeq transcript <NA> <NA>
[4] NC_000001.11 11874-12227 + | BestRefSeq exon <NA> <NA>
[5] NC_000001.11 12613-12721 + | BestRefSeq exon <NA> <NA>
[6] NC_000001.11 13221-14409 + | BestRefSeq exon <NA> <NA>
ID Dbxref Name
<character> <CharacterList> <character>
[1] id0 taxon:9606 1
[2] gene0 GeneID:100287102,HGNC:HGNC:37102 DDX11L1
[3] rna0 GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102 NR_046018.2
[4] id1 GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102 <NA>
[5] id2 GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102 <NA>
Many columns are hidden for brevity but take me word that they don't contain anything useful for our purposes. Primarily our goal is to have an exon-gene relationship where gene is defined by the appropriate ID type. For Gencode this is trivial, as each entry of type exon
has a gene_id
column. For RefSeq this information is inside the Dbxref
column which requires additional trimming to reduce it down to the appropriate ID. For ENSEMBL this is even worse because you actually have to trace the parentage up to the transcript then find the parent of the transcript. I don't have any experience with UCSC annotations so that's not even properly supported.
If you know any easy way out of this mess I'd love to hear it. Otherwise the plan would be to write the annotation source specific parsing code in R to obtain a 4 column data frame of chr, start, end, gene_id
and export 4 vectors into C++ to generate the chr-gene-exon information.
I never have this problem when I'm using GTF files for featureCounts
. Every row is an exon and contains a gene_id
- and that's it. I've been getting these GTF files straight from Ensembl, so it's not like I had to do a lot of pre-processing in order to get them to play nice with featureCounts
.
If you must take some parent-based format as input, I'd suggest breaking up the logic as follows:
sc_exon_mapping
takes a GRanges
of exon intervals with gene_id
information in the metadata. 4 vectors are then passed into C++ for internal processing. The key is to make this function format-agnostic. (You could even supply an additional argument specifying which metadata column contains gene IDs, in case they're called ID
or something non-standard.)GRanges
. For GTF and BED files this is trivial, just import
and you're done. Perhaps some filtering is required on the exon
keyword, but this is still only a one-liner. For GFF3 files, some more work is probably required, but in any case, you want to separate this from the job of the exon mapping itself.So yes, your plan seems appropriate, though it seems like you chose a difficult format to start with.
Some support incoming: https://github.com/Shians/scPipe/commit/02517ab46a09efae6a48c365dee7c5ca842d5da6
So the goal is to generate a SAF style data.table (http://bioinf.wehi.edu.au/featureCounts/). I've written up helper functions to import different annotation formats, the major upside of using rtracklayer::import()
is that I get free handling of all gff/gtf as well as gzipped files.
So the strategy is to extract all entries with type == "exon"
, with their gene_id
column as well as start, end and strand. For ENSEMBL gff3 and Refseq gff3 the exon entries have no gene_id
column, so helpers have been implemented to fill in that information.
In C++ code has been written to accept SAF as a Rcpp::DataFrame and construct the necessary annotation object. What is left to do is:
In my branch it's now possible to use a single GRanges object as annotation. I don't consider the feature quite complete yet.
Additional work:
rtracklayer::import
I may also need to look into GRangesList and greater integration with TxDb (which I have no experience with)
Using
sc_exon_mapping
as an example: it would be convenient to be able to supply aGRanges
object directly asannofn=
, especially in situations where some editing of the annotation files are necessary, e.g., to filter out undesirable features or to add custom features. Currently I've been loading files in, editing the resultingGRanges
, then saving them back to a new GFF3 file via rtracklayer. I'd like to skip the last step.The same request applies to
sc_demultiplex
andsc_gene_counting
, which seem to take the path of the annotation file but not adata.frame
orDataFrame
or something in-memory. It is often the case that I need to load such files in anyway because they are not exactly formatted as required for these functions.