LTLA commented 6 years ago

Using sc_exon_mapping as an example: it would be convenient to be able to supply a GRanges object directly as annofn=, especially in situations where some editing of the annotation files are necessary, e.g., to filter out undesirable features or to add custom features. Currently I've been loading files in, editing the resulting GRanges, then saving them back to a new GFF3 file via rtracklayer. I'd like to skip the last step.

The same request applies to sc_demultiplex and sc_gene_counting, which seem to take the path of the annotation file but not a data.frame or DataFrame or something in-memory. It is often the case that I need to load such files in anyway because they are not exactly formatted as required for these functions.

Shians commented 6 years ago

Interesting idea, I could have a think about how to implement it. The core of scPipe was originally pure C++, which is why everything tended to be file-based.

Fundamentally what we get out of the annotation is an std::unordered_map from chromosome name to std::vector<Gene> where Gene is a class based on std::vector<Exon> where the exons are anonymous and flattened to be non-overlapping. So I imagine I'd just have to figure out how to transfer the data into C++ and parse it into the desired format.

The lazier solution could be just to write a wrapper using tempfiles to do exactly what you described but in the background and automatically cleaned up. I'd imagine it's much slower, has a bit more disk usage overhead, but potentially negligible in the context of the remainder of the pipeline.

LTLA commented 6 years ago

I imagine that this would actually be fairly easy, if you're willing to take I/O out of C++. Specifically, you can rtracklayer::import a file into a GRanges, and then provide the necessary details to the C++ code (e.g., start, end, gene id) as fully fledged arguments. This has several advantages:

Your function can now accept annotation files of any acceptable format, not just GFF3 or BED. Needless to say, it can also accept GRanges objects directly, which further improves flexibility.
There is no longer any need to guess GENCODE/ENSEMBL/etc., because people should be able to directly specify which field contains the gene IDs when they supply their GRanges object. This guessing behaviour was quite frustrating when supplying a manually constructed annotation file.
You simplify your C++ code by no longer needing to do I/O there, which reduces maintenance. In fact, you take advantage of the optimized C routines for I/O in the rtracklayer package.

It is true that reading the annotation file into R will increase memory usage; but you already have to load the entire file into memory anyway to create your equivalent C++ structure, and any memory overhead from annotation is likely to be of fixed size and negligible (compared to the sequencing data itself).

In my packages, I use R to do all of my I/O except for things related to heavy lifting from the BAM file. This is one considerable advantage in writing R packages versus writing standalone C++ binaries.

PeteHaitch commented 6 years ago

+1 for looking into dumping responsibility for file parsing onto rtracklayer. For example, I just had a segfault when I accidentally passed sc_exon_mapping() the GENCODE GTF rather than the GFF3 file.

LuyiTian commented 6 years ago

yes. I don't have a clear overall design at first. I wrote the C++ part first as a stand-alone command line tools, and decide to build an R package, later. so the design is not very R centric. they are inconvenient in R and I should add those features in.

BTW, what are your thoughts on Rcpp conversion VS direct SEXP? I feel you use both.

LTLA commented 6 years ago

Use Rcpp. I switched a few years ago and I haven't looked back. From my perspective:

Rcpp protects developers from R-core shenanigans related to ALTREP, such as the fact that consecutive indices or repeats of the same value are not fully instantiated in memory. While it's true that INTEGER() and other macros would probably expand these constructs into contiguous vectors, those will be operations that cause memory allocations (and thus trigger the garbage collector), which will break code that previously treated read-only macros like INTEGER as no-allocation operations.
Rcpp takes the issue of PROTECTion out of our hands. This is worth the price of admission by itself. There are some edge cases involving RNGscope, but it otherwise works as expected.
Rcpp provides smooth exception handling with BEGIN_RCPP and END_RCPP, which means that you can safely throw in C++ to trigger an R-level error without segmentation faults.

There's probably a whole bunch of other things I've taken for granted. Long story short, for R package development, Rcpp has definitely been a plus. Or a plus-plus, so to speak.

Shians commented 6 years ago

Had a look using rtracklayer::import() unfortunately the issues that made me write the source specific code still persist, though it may alleviate some complexity with regards to GFF3/GTF/BED.

ENSEMBL

> ensembl_anno[17:22, ]
GRanges object with 6 ranges and 24 metadata columns:
      seqnames      ranges strand |   source                   type     score
         <Rle>   <IRanges>  <Rle> | <factor>               <factor> <numeric>
  [1]        1 11869-14409      + |   havana             pseudogene      <NA>
  [2]        1 11869-14409      + |   havana                lnc_RNA      <NA>
  [3]        1 11869-12227      + |   havana                   exon      <NA>
  [4]        1 12613-12721      + |   havana                   exon      <NA>
  [5]        1 13221-14409      + |   havana                   exon      <NA>
  [6]        1 12010-13670      + |   havana pseudogenic_transcript      <NA>
          phase                         ID           Alias external_name  logic_name
      <integer>                <character> <CharacterList>   <character> <character>
  [1]      <NA>       gene:ENSG00000223972            <NA>          <NA>      havana
  [2]      <NA> transcript:ENST00000456328            <NA>          <NA>        <NA>
  [3]      <NA>                       <NA>            <NA>          <NA>        <NA>
  [4]      <NA>                       <NA>            <NA>          <NA>        <NA>
  [5]      <NA>                       <NA>            <NA>          <NA>        <NA>
  [6]      <NA> transcript:ENST00000450305            <NA>          <NA>        <NA>
              gene_id     version                     Parent         tag
          <character> <character>            <CharacterList> <character>
  [1] ENSG00000223972           5                       <NA>        <NA>
  [2]            <NA>           2       gene:ENSG00000223972       basic
  [3]            <NA>           1 transcript:ENST00000456328        <NA>
  [4]            <NA>           1 transcript:ENST00000456328        <NA>
  [5]            <NA>           1 transcript:ENST00000456328        <NA>
  [6]            <NA>           2       gene:ENSG00000223972       basic

GENCODE

> head(gencode_anno)
GRanges object with 6 ranges and 22 metadata columns:
      seqnames      ranges strand |   source       type     score     phase
         <Rle>   <IRanges>  <Rle> | <factor>   <factor> <numeric> <integer>
  [1]     chr1 11869-14409      + |   HAVANA       gene      <NA>      <NA>
  [2]     chr1 11869-14409      + |   HAVANA transcript      <NA>      <NA>
  [3]     chr1 11869-12227      + |   HAVANA       exon      <NA>      <NA>
  [4]     chr1 12613-12721      + |   HAVANA       exon      <NA>      <NA>
  [5]     chr1 13221-14409      + |   HAVANA       exon      <NA>      <NA>
  [6]     chr1 12010-13670      + |   HAVANA transcript      <NA>      <NA>
                            ID           gene_id                          gene_type
                   <character>       <character>                        <character>
  [1]        ENSG00000223972.5 ENSG00000223972.5 transcribed_unprocessed_pseudogene
  [2]        ENST00000456328.2 ENSG00000223972.5 transcribed_unprocessed_pseudogene
  [3] exon:ENST00000456328.2:1 ENSG00000223972.5 transcribed_unprocessed_pseudogene
  [4] exon:ENST00000456328.2:2 ENSG00000223972.5 transcribed_unprocessed_pseudogene
  [5] exon:ENST00000456328.2:3 ENSG00000223972.5 transcribed_unprocessed_pseudogene
  [6]        ENST00000450305.2 ENSG00000223972.5 transcribed_unprocessed_pseudogene

RefSeq

> head(refseq_anno)
GRanges object with 6 ranges and 96 metadata columns:
          seqnames      ranges strand |     source       type     score     phase
             <Rle>   <IRanges>  <Rle> |   <factor>   <factor> <numeric> <integer>
  [1] NC_000001.11 1-248956422      + |     RefSeq     region      <NA>      <NA>
  [2] NC_000001.11 11874-14409      + | BestRefSeq pseudogene      <NA>      <NA>
  [3] NC_000001.11 11874-14409      + | BestRefSeq transcript      <NA>      <NA>
  [4] NC_000001.11 11874-12227      + | BestRefSeq       exon      <NA>      <NA>
  [5] NC_000001.11 12613-12721      + | BestRefSeq       exon      <NA>      <NA>
  [6] NC_000001.11 13221-14409      + | BestRefSeq       exon      <NA>      <NA>
               ID                                               Dbxref        Name
      <character>                                      <CharacterList> <character>
  [1]         id0                                           taxon:9606           1
  [2]       gene0                     GeneID:100287102,HGNC:HGNC:37102     DDX11L1
  [3]        rna0 GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102 NR_046018.2
  [4]         id1 GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102        <NA>
  [5]         id2 GeneID:100287102,Genbank:NR_046018.2,HGNC:HGNC:37102        <NA>

Many columns are hidden for brevity but take me word that they don't contain anything useful for our purposes. Primarily our goal is to have an exon-gene relationship where gene is defined by the appropriate ID type. For Gencode this is trivial, as each entry of type exon has a gene_id column. For RefSeq this information is inside the Dbxref column which requires additional trimming to reduce it down to the appropriate ID. For ENSEMBL this is even worse because you actually have to trace the parentage up to the transcript then find the parent of the transcript. I don't have any experience with UCSC annotations so that's not even properly supported.

If you know any easy way out of this mess I'd love to hear it. Otherwise the plan would be to write the annotation source specific parsing code in R to obtain a 4 column data frame of chr, start, end, gene_id and export 4 vectors into C++ to generate the chr-gene-exon information.

LTLA commented 6 years ago

I never have this problem when I'm using GTF files for featureCounts. Every row is an exon and contains a gene_id - and that's it. I've been getting these GTF files straight from Ensembl, so it's not like I had to do a lot of pre-processing in order to get them to play nice with featureCounts.

If you must take some parent-based format as input, I'd suggest breaking up the logic as follows:

sc_exon_mapping takes a GRanges of exon intervals with gene_id information in the metadata. 4 vectors are then passed into C++ for internal processing. The key is to make this function format-agnostic. (You could even supply an additional argument specifying which metadata column contains gene IDs, in case they're called ID or something non-standard.)
You then have external helper functions for getting these files into the desired GRanges. For GTF and BED files this is trivial, just import and you're done. Perhaps some filtering is required on the exon keyword, but this is still only a one-liner. For GFF3 files, some more work is probably required, but in any case, you want to separate this from the job of the exon mapping itself.

So yes, your plan seems appropriate, though it seems like you chose a difficult format to start with.

Shians commented 6 years ago

Some support incoming: https://github.com/Shians/scPipe/commit/02517ab46a09efae6a48c365dee7c5ca842d5da6

So the goal is to generate a SAF style data.table (http://bioinf.wehi.edu.au/featureCounts/). I've written up helper functions to import different annotation formats, the major upside of using rtracklayer::import() is that I get free handling of all gff/gtf as well as gzipped files.

So the strategy is to extract all entries with type == "exon", with their gene_id column as well as start, end and strand. For ENSEMBL gff3 and Refseq gff3 the exon entries have no gene_id column, so helpers have been implemented to fill in that information.

In C++ code has been written to accept SAF as a Rcpp::DataFrame and construct the necessary annotation object. What is left to do is:

Construct an overload for DataFrame input rather than filename
Test code on mouse genomes (currently tested on Gencode, ENSEMBL, Refseq Human genomes in GFF3 and GTF formats)
Robustify case when gene_id is missing and inference has failed
Warnng for case when resulting SAF contains NA

In my branch it's now possible to use a single GRanges object as annotation. I don't consider the feature quite complete yet.

Additional work:

Accept data.frame in SAF format as annotation
Route importing annotation with filename through rtracklayer::import

I may also need to look into GRangesList and greater integration with TxDb (which I have no experience with)

LuyiTian / scPipe

Support pre-loaded annotation data structures #87

ENSEMBL

GENCODE

RefSeq