Bioconductor / Contributions

Contribute Packages to Bioconductor
131 stars 33 forks source link

gread #25

Closed asrinivasan-oa closed 7 years ago

asrinivasan-oa commented 8 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 8 years ago

Hi @asrinivasan-oa

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: gread
Type: Package
Title: Fast Reading and Processing of Common Gene Annotation and Next 
        Generation Sequencing Format Files
Version: 0.99.0
Authors@R: c(
    person("Arunkumar", "Srinivasan", , "asrinivasan@openanalytics.eu", 
    c("aut", "cre")), person("Open Analytics", role = "cph"))
Maintainer: Arunkumar Srinivasan <asrinivasan@openanalytics.eu>
Depends:
    R (>= 3.3)
Imports:
    stats,
    utils,
    methods,
    tools,
    data.table (>= 1.9.6),
    stringi,
    S4Vectors,
    Rsamtools,
    rtracklayer,
    IRanges,
    GenomicRanges,
    GenomicFeatures,
    GenomicAlignments
Suggests:
    knitr
Description: It allows for fast reading and processing of genome read and
    annotation files such as GTF/GFF/BED/BAM. In addition, it also provides
    functions for advanced manipulations such as filling missing intron 
    coordinates, extracting non-overlapping intron coordinates, extracting 
    overlapping genes etc.
License: GPL-3
LazyData: true
VignetteBuilder: knitr
RoxygenNote: 5.0.1
biocViews: Annotation, DataImport, Sequencing, RNASeq, Software
bioc-issue-bot commented 8 years ago

Your package has been approved for building. Your package is now submitted to our queue.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/gread_buildreport_20160620113428.html

asrinivasan-oa commented 8 years ago

The checks seem to run on BiocCheck v1.9.1.. I'm on v1.9.2 where the requirement that I've to be registered on bioc-devel doesn't pop up. Should I switch to v1.9.1 as well?

Here's the relevant output (from v1.9.2 on my system):

* Checking formatting of DESCRIPTION, NAMESPACE, man pages, R source,
  and vignette source...
    * CONSIDER: Shortening lines; 3 lines (0%) are > 80 characters
      long.
    * CONSIDER: Indenting lines with a multiple of 4 spaces; 92 lines
      (2%) are not.
  See http://bioconductor.org/developers/how-to/coding-style/
* Checking for canned comments in man pages...
* Checking if package already exists in CRAN...
* Checking for support site registration...
* Maintainer is registered at support site!

Summary:
REQUIRED count: 0
RECOMMENDED count: 0
CONSIDERATION count: 3

I've registered on bioc-devel and have bumped version to run checks again.

bioc-issue-bot commented 8 years ago

Received a valid push; starting a build. Commits are:

815c058 Version bump to run checks again (after registerin...

dtenenba commented 8 years ago

See the BIocCheck vignette, where it says that that check is only enabled on our build machines because it requires special authorization to look up list membership.

bioc-issue-bot commented 8 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/gread_buildreport_20160620140939.html

hpages commented 8 years ago

Hi Arun,

Thanks for submitting the gread package. data.table is a very popular and powerful package that is already used behind the scene by numerous Bioconductor packages to speed up many operations.

gread goes one step further though by defining and exposing data.table derived data structures that are redundant with Bioconductor core classes. Interoperability is a key feature in Bioconductor. It is achieved by adoption of common data structures like GRanges or GAlignments across packages. This standardization of data structures has greatly contributed to the success of the project. Hundreds of Bioconductor packages use these data structures those days for their input and/or output and for all kinds of analyses (including visualization). Therefore we discourage the use of new data structures that are redundant with the "standard" data structures, because they hurt interoperability as well as usability of the Bioconductor ecosystem in the long run.

I'll refer to our package guidelines and ask you to "Document data structures used and, if different from data structures used by similar packages, explain why a different data structure was used."

Other minor issues:

1) Please consider moving some of the packages currently in Imports to Depends. Right now, when I call as_granges() on a gread object, I end up with a GRanges object that seems broken (nothing works on it, not even the basic accessors seqnames(), start(), end(), etc...)

2) Note that in Biocondcutor, 'ignore.strand' is used consistently (almost) everywhere instead of 'ignore_strand'.

3) There is actually no need for the new as_granges() verb: this is exactly what the granges() generic in the GenomicRanges package is intended for. But because you use S3 classes and dispatch, you would need to define the following default S4 method:

setMethod("granges", "ANY",
    function(x, use.names=TRUE, use.mcols=FALSE, ...) UseMethod("granges")
)

before you can define your own granges.data.table S3 method.

4) Note that the broom package (CRAN) defines a tidy() generic and methods and the biobroom package (Bioconductor) extends the capabilities of the broom package by defining tidy() methods for Bioconductor objects. gread::tidy() generic conflicts with broom::tidy().

5) Use consistent case for the class names e.g. in the documentation for tidy():

A tidied object of class ‘GTF’, ‘GFF’, ‘BED’ or ‘BAM’, ...

but the real class names are lower case.

6) Use mcols() instead of elementMetadata() in your code.

7) Use <- instead of = for assignment, at least in the examples and vignette.

Don't hesitate to ask on the bioc-devel mailing list if you have concerns or questions about this.

Thanks, H.

asrinivasan-oa commented 8 years ago

Hi Hervé, thanks for the super fast feedback!

Regarding the major issue (on interoperability): makes sense.

As long as it's fine / possible to inherit from S4 objects and dispatch S3 methods on them, it is totally fine. To be clear, I don't mind having a "GRanges" object. But I'd like to be able to differentiate as to whether it's a "bam" file that's been loaded or a "gtf" file. I'll try and see if it is possible. Really not a fan of S4.

I just don't want to remember many packages and many functions and arguments to load different formats (more formats might be added in the future).

This is one of the three packages (gread, gcount, ganalyse) in the pipeline. The other two import/suggest this package, I've not therefore submitted the other two as issues (which is what the guidelines recommended? AFAICT). The purpose of these packages is to provide a quick way for performing quite standard RNASeq analysis (load gtf/bam files, obtain counts, perform DGE using edgeR/limma). In essence, these are mostly wrappers for making common analyses easier.

Minor issues:

  1. I don't quite understand why it should be moved to Depends, since the idea behind the function is for the user to get a GRanges object for any further manipulation. But that's fine.
  2. Thanks! That helps.

4-7 Okay, will change/fix.

Thanks again. I'll try to commit with the feedbacks incorporated ASAP.

hpages commented 7 years ago

Hi Arun,

Yes you can inherit from S4 objects and dispatch S3 methods on them.

When you say:

To be clear, I don't mind having a "GRanges" object. But I'd like to be able to differentiate as to whether it's a "bam" file that's been loaded or a "gtf" file.

You mean you want to be able to differentiate as to whether the object represents read alignments or a gene model? IMO this is more important than the format of the file where the data is coming from. Furthermore, the format of the file is not always enough to know what the data is about. For example a GFF file can contain many things, not just genes, transcripts, exons and CDS. It can contain alignments. A BAM file can contain single-end or paired-end alignments. FWIW the Bioconductor way is to have standard containers for these things: GAlignments and GAlignmentPairs for single- and paired-end reads, and TxDb objects for gene models. Once the data is loaded in these containers, it doesn't matter what was the format of the file where the data was loaded from.

When you say:

I just don't want to remember many packages and many functions and arguments to load different formats (more formats might be added in the future).

Nobody wants to remember these things. That's why we have import() in the rtracklayer package.

About 1). It's all about user-friendliness. As I said, when I call as_granges() on a gread object, I end up with a GRanges object that seems broken (nothing works on it, not even the basic accessors seqnames(), start(), end(), etc...). For start() and end(), it's even worse than that: they work but return garbbage. Having GenomicRanges in Depends would address that. I'm not saying you should do that, maybe there are other ways to address this. All I'm saying is that something can and should be done to improve user-friendliness. Imagine for example a package that provides a function that returns a data.table object on which $ doesn't work. Most users would be confused and would probably think the software is broken (and some of them won't even bother to ask on our support site, they'll just stop using the software).

Hope this makes sense.

Thanks for working on these changes, H.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

a240f18 Use <- in examples. 5ecd01f Use mcols() instead of elementMetadata(). 55c3922 lower cases for gtf/gff/bed/bam everywhere. eac8ba3 change tidy() to tidy_cols(). 6d9ea5d Update vignette with tidy_cols(). 94b397f All inputs and outputs of exported functions are R... 87256fb Update docs. 2a417a7 Document new features. 46d302c Please R CMD check. 6c062ec Update Doc. 0896096 Don't run examples for as_granges. 748f34f Add as_data_table to tests. c5383ac Update vignette. a8908dc Remove notes on tidy argument in vignette. 2f177f2 Update NEWS by removing notes on 'tidy()'. 74eca89 Bump version.

asrinivasan-oa commented 7 years ago

Interoperability is a key feature in Bioconductor. ...

All exported functions output are objects of class GRanges. Please let me know if there are better ways to handle the dispatch of S3 methods on S4 objects.

1) Please consider moving some of the packages currently in Imports to Depends. Right now, when I call as_granges() on a gread object, I end up with a GRanges object that seems broken (nothing works on it, not even the basic accessors seqnames(), start(), end(), etc...)

GenomicRanges packages is in Depends now. as_granges is not exported anymore.

2) Note that in Biocondcutor, 'ignore.strand' is used consistently (almost) everywhere instead of 'ignore_strand'.

For consistency, I use _ throughout. But since the function isn't exported, this should be fine, I suppose...

4) Note that the broom package (CRAN) defines a tidy() generic and methods and the biobroom package (Bioconductor) extends the capabilities of the broom package by defining tidy() methods for Bioconductor objects. gread::tidy() generic conflicts with broom::tidy().

Removed the tidy() function altogether. It wasn't that important a feature.

5) Use consistent case for the class names e.g. in the documentation for tidy(): A tidied object of class ‘GTF’, ‘GFF’, ‘BED’ or ‘BAM’, ... but the real class names are lower case.

Thanks for spotting this. Fixed.

6) Use mcols() instead of elementMetadata() in your code.

Done.

7) Use <- instead of = for assignment, at least in the examples and vignette.

Done.


Thanks once again for the feedback. I think I've made almost all of the suggested changes. Please see replies inline for your recent reply.

You mean you want to be able to differentiate as to whether the object represents read alignments or a gene model? IMO this is more important than the format of the file where the data is coming from. Furthermore, the format of the file is not always enough to know what the data is about. For example a GFF file can contain many things, not just genes, transcripts, exons and CDS. It can contain alignments.

The purpose, in my case, for deciding it based on input file format is because, for example, bed files need their start coordinates incremented by 1. gff / gtf files' attributes column needs to be handled appropriately. Bam files have custom flags that could be loaded (which I do) for custom filtering.

In addition, some gtf/gff files are just quite badly formatted that it's quite hard to load them efficiently. In such a scenario, I fall back to read.table, and clean up the attributes column myself. I figured it's better to dispatch based on file formats.

A BAM file can contain single-end or paired-end alignments. FWIW the Bioconductor way is to have standard containers for these things: GAlignments and GAlignmentPairs for single- and paired-end reads.

What do you do with the reads for which only one pair matches? Do you discard them away? I like to keep it in GRanges objects, since that gives me the flexibility to do my own counting (/ modify them as necessary).

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/gread_buildreport_20160725102017.html

asrinivasan-oa commented 7 years ago

Hello, are there any updates as to whether these changes are alright?

bioc-issue-bot commented 7 years ago

We only start builds when the Version field in the DESCRIPTION file is incremented. For example, by changing

Version: 0.99.0

to

Version 0.99.1

If you did not intend to start a build, you don't need to do anything. If you did want to start a build, increment the Version: field and try again.

hpages commented 7 years ago

Hi Arun,

I'm sorry for my slow response. The changes are alright. I'm still not sure why you need to introduce new classes (bam, gtf, gff) for representing aligned reads and gene models. They duplicate widely used BioC classes GAlignments and TxDb. By doing so you limit interoperability of your software with the BioC ecosystem which is unfortunate. You also add an extra burden on the BioC user already used to deal with the existing classes. My last request is that you at least export and document these classes. Then the package will be ready for addition to the BioC svn repository and to the nightly builds. Thanks.

Cheers, H.

asrinivasan-oa commented 7 years ago

Hi Herves, thanks for the feedback. From now on, my colleagues will be taking over this (and other packages that will follow the acceptance of this package), as Aug'16 was my last month at Open Analytics.

I guess, for this package alone, it would be best if my colleagues continue with the changes from your feedback directly to this account. If there are better ways or if you prefer another way, please let me know. I did ask about this issue on the bioc-devel mailing list, and there was no reply.

Obviously the maintainer part under DESCRIPTION will also change.


Regarding your question on inheriting from BioC objects and extending it, as I previously mentioned, I implement my own methods for counting reads. I don't know how the counting is performed using internal BioC functions (e.g., what if one of the paired read is alone mapped). I don't think there's a single approach to read counting.. and I believe every approach has its own issues.

I use additional flags from BAM files to take care of some additional filtering, for example. I just find it cleaner this way, since it makes implementing methods for counting cleaner, depending on the input object.

I'm not sure I follow how interoperability is affected. And there is no extra burden AFAICT. The derived class is just to dispatch the relevant method (if exists as in 'gcounts' package that is due to be submitted for review). If not, it'd just fallback to existing GRanges methods.. Perhaps I'm not fully understanding these points.

Sure, we can document these classes. They don't affect anything if one uses 'gread' in isolation. But it helps to use these classes with 'gcount' (which is the next package in line for submission).

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

1b0acc7 add documentation for S4 classes

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/gread_buildreport_20160913033508.html

hpages commented 7 years ago

Hi Arun,

Regarding your question on inheriting from BioC objects and extending it I never said or suggested that you needed to inherit or extend anything. All I've been trying to say so far is that your bam class is redundant with the GAlignments class. What you're representing with your bam objects (i.e. aligned reads loaded from a BAM file) is exactly what everybody else uses GAlignments objects for.

I implement my own methods for counting reads You can implement your own read counting method on top of a GAlignments object. Why couldn't you? Did you take the time to familiarize yourself with readGAlignments() and GAlignments objects? You can coerce the GAlignments object to a GRanges or GRangesList object and proceed from there. The choice between GRanges or GRangesList depends on whether you want to ignore the skipped regions (i.e. N's in the cigar) or not. All these things are documented in the man page for GAlignments objects.

I don't know how the counting is performed using internal BioC functions (e.g., what if one of the paired read is alone mapped). Why would you need to know internal BioC functions in order to implement your own counting methods on top of a GAlignments object? If you were able to do it for your bam objects, then you should be able to do it for GAlignments objects.

I don't think there's a single approach to read counting.. and I believe every approach has its own issues. I agree. But you seem to assume that the choice of container for representing the aligned reads (i.e. bam vs GAlignments) determines the counting approach. Well, no, it doesn't.

I use additional flags from BAM files to take care of some additional filtering, for example. readGAlignments() allows you to load the flag field or any other BAM field you want into the returned GAlignments object. It also allows you to load any tag from the BAM file. It allows you to filter out reads at load time based on the flag. It also allows to process a BAM file by chunk. It's a very powerful and flexible function that has been around for years and that is very widely used. Did you check its man page?

I'm not sure I follow how interoperability is affected. Because now we have 2 classes for representing exactly the same thing, except that yours has some weaknesses that I don't have the time to comment here (I'll just say that there are good reasons why we don't use a GRanges object to represent aligned reads). And there is a great deal of tools in Bioconductor that work on one (GAlignments) but not the other (bam).

And there is no extra burden AFAICT Yes there is. Many people are used to get a GAlignments object when they load aligned reads from a BAM file. Over the years they've learned how to operate on these objects via the GAlignments API. When they get a bam object, they now get an object of a class they've never seen before. So they need to learn how to operate on these objects.

Again, Bioconductor is an ecosystem that promotes interoperabity. Interoperabity is achieved by maximum reuse of the core classes (e.g. GRanges, GAlignments, SummarizedExperiment, TxDb, etc...). Before contributors come up with new classes, they are expected to familiarize themselves with the core classes. If they decide that the core classes are not suited for the kind of data that they need to represent, then they are welcome to implement their own classes. This happens sometimes and is ok, as long as they justify their decision. In your case, the classes you introduce are completely redundant with some of the core classes. For example, there is nothing in your bam objects that couldn't be put in a GAlignments object. I'm sorry, but the bam class is gratuitous and unnecessary.

Anyway, thanks for exporting and documenting these classes. I'll look at this revised version of the gread package in the next couple of days and will let you know.

Regards, H.

hpages commented 7 years ago

[I'm posting this again since there was some formatting issue with my previous post. Sorry...]

Regarding your question on inheriting from BioC objects and extending it

I never said or suggested that you needed to inherit or extend anything. All I've been trying to say so far is that your bam class is redundant with the GAlignments class. What you're representing with your bam objects (i.e. aligned reads loaded from a BAM file) is exactly what everybody else uses GAlignments objects for.

I implement my own methods for counting reads

You can implement your own read counting method on top of a GAlignments object. Why couldn't you? Did you take the time to familiarize yourself with readGAlignments() and GAlignments objects? You can coerce the GAlignments object to a GRanges or GRangesList object and proceed from there. The choice between GRanges or GRangesList depends on whether you want to ignore the skipped regions (i.e. N's in the cigar) or not. All these things are documented in the man page for GAlignments objects.

I don't know how the counting is performed using internal BioC functions (e.g., what if one of the paired read is alone mapped).

Why would you need to know internal BioC functions in order to implement your own counting methods on top of a GAlignments object? If you were able to do it for your bam objects, then you should be able to do it for GAlignments objects.

I don't think there's a single approach to read counting.. and I believe every approach has its own issues.

I agree. But you seem to assume that the choice of container for representing the aligned reads (i.e. bam vs GAlignments) determines the counting approach. Well, no, it doesn't.

I use additional flags from BAM files to take care of some additional filtering, for example.

readGAlignments() allows you to load the flag field or any other BAM field you want into the returned GAlignments object. It also allows you to load any tag from the BAM file. It allows you to filter out reads at load time based on the flag. It also allows to process a BAM file by chunk. It's a very powerful and flexible function that has been around for years and that is very widely used. Did you check its man page?

I'm not sure I follow how interoperability is affected.

Because now we have 2 classes for representing exactly the same thing, except that yours has some weaknesses that I don't have the time to comment here (I'll just say that there are good reasons why we don't use a GRanges object to represent aligned reads). And there is a great deal of tools in Bioconductor that work on one (GAlignments) but not the other (bam).

And there is no extra burden AFAICT

Yes there is. Many people are used to get a GAlignments object when they load aligned reads from a BAM file. Over the years they've learned how to operate on these objects via the GAlignments API. When they get a bam object, they now get an object of a class they've never seen before. So they need to learn how to operate on these objects.

Again, Bioconductor is an ecosystem that promotes interoperabity. Interoperabity is achieved by maximum reuse of the core classes (e.g. GRanges, GAlignments, SummarizedExperiment, TxDb, etc...). Before contributors come up with new classes, they are expected to familiarize themselves with the core classes. If they decide that the core classes are not suited for the kind of data that they need to represent, then they are welcome to implement their own classes. This happens sometimes and is ok, as long as they justify their decision. In your case, the classes you introduce are completely redundant with some of the core classes. For example, there is nothing in your bam objects that couldn't be put in a GAlignments object. I'm sorry, but the bam class is gratuitous and unnecessary.

Anyway, thanks for exporting and documenting these classes. I'll look at this revised version of the gread package in the next couple of days and will let you know.

Regards, H.

hpages commented 7 years ago

Hi Arun,

How much testing your read_gff() function has gone thru?

library(GenomicFeatures)
GFF3_files <- system.file("extdata", "GFF3_files", package="GenomicFeatures")
a_path <- file.path(GFF3_files, "a.gff3")
read_gff(a_path)
# Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
#   scan() expected 'a real', got '.'

But more importantly: why would you try to reinvent the wheel with this function? As I mentioned earlier, we already have the import() function from the rtracklayer package for this:

library(rtracklayer)
import(a_path)
# GRanges object with 4403 ranges and 16 metadata columns:
#            seqnames               ranges strand |      source           type
#               <Rle>            <IRanges>  <Rle> |    <factor>       <factor>
#      [1] SL2.40ch00       [16437, 18189]      + | ITAG_eugene           gene
#      [2] SL2.40ch00       [16437, 18189]      + | ITAG_eugene           mRNA
#      [3] SL2.40ch00       [16437, 17275]      + | ITAG_eugene           exon
#      [4] SL2.40ch00       [16437, 16479]      + | ITAG_eugene five_prime_UTR
#      [5] SL2.40ch00       [16480, 17275]      + | ITAG_eugene            CDS
#      ...        ...                  ...    ... .         ...            ...
#   [4399] SL2.40ch00 [13786143, 13786301]      - | ITAG_eugene           exon
#   [4400] SL2.40ch00 [13786143, 13786301]      - | ITAG_eugene            CDS
#   [4401] SL2.40ch00 [13786302, 13787123]      - | ITAG_eugene         intron
#   [4402] SL2.40ch00 [13787124, 13787465]      - | ITAG_eugene           exon
#   [4403] SL2.40ch00 [13787124, 13787465]      - | ITAG_eugene            CDS
#              score     phase           Alias
#          <numeric> <integer> <CharacterList>
#      [1]      <NA>      <NA>  Solyc00g005000
#      [2]      <NA>      <NA>                
#      [3]      <NA>      <NA>                
#      [4]      <NA>      <NA>                
#      [5]      <NA>         0                
#      ...       ...       ...             ...
#   [4399]      <NA>         0                
#   [4400]      <NA>         0                
#   [4401]      <NA>      <NA>                
#   [4402]      <NA>         0                
#   [4403]      <NA>         0                
#                                           ID               Name  from_BOGAS
#                                  <character>        <character> <character>
#      [1]               gene:Solyc00g005000.2   Solyc00g005000.2           1
#      [2]             mRNA:Solyc00g005000.2.1 Solyc00g005000.2.1           1
#      [3]           exon:Solyc00g005000.2.1.1               <NA>           1
#      [4] five_prime_UTR:Solyc00g005000.2.1.0               <NA>           1
#      [5]            CDS:Solyc00g005000.2.1.1               <NA>           1
#      ...                                 ...                ...         ...
#   [4399]           exon:Solyc00g052940.2.1.2               <NA>           1
#   [4400]            CDS:Solyc00g052940.2.1.2               <NA>           1
#   [4401]         intron:Solyc00g052940.2.1.2               <NA>           1
#   [4402]           exon:Solyc00g052940.2.1.1               <NA>           1
#   [4403]            CDS:Solyc00g052940.2.1.1               <NA>           1
#               length
#          <character>
#      [1]        1753
#      [2]        1753
#      [3]        <NA>
#      [4]        <NA>
#      [5]        <NA>
#      ...         ...
#   [4399]        <NA>
#   [4400]        <NA>
#   [4401]        <NA>
#   [4402]        <NA>
#   [4403]        <NA>
#                                                                                                                          Note
#                                                                                                               <CharacterList>
#      [1]                                                                                                                     
#      [2] Aspartic proteinase nepenthesin I (AHRD V1 **-- A9ZMF9_NEPAL); contains Interpro domain(s)  IPR001461  Peptidase A1 
#      [3]                                                                                                                     
#      [4]                                                                                                                     
#      [5]                                                                                                                     
#      ...                                                                                                                  ...
#   [4399]                                                                                                                     
#   [4400]                                                                                                                     
#   [4401]                                                                                                                     
#   [4402]                                                                                                                     
#   [4403]                                                                                                                     
#            Ontology_term                  Parent interpro2go_term     nb_exon
#          <CharacterList>         <CharacterList>  <CharacterList> <character>
#      [1]                                                                 <NA>
#      [2]      GO:0006508   gene:Solyc00g005000.2       GO:0006508           2
#      [3]                 mRNA:Solyc00g005000.2.1                         <NA>
#      [4]                 mRNA:Solyc00g005000.2.1                         <NA>
#      [5]                 mRNA:Solyc00g005000.2.1                         <NA>
#      ...             ...                     ...              ...         ...
#   [4399]                 mRNA:Solyc00g052940.2.1                         <NA>
#   [4400]                 mRNA:Solyc00g052940.2.1                         <NA>
#   [4401]                 mRNA:Solyc00g052940.2.1                         <NA>
#   [4402]                 mRNA:Solyc00g052940.2.1                         <NA>
#   [4403]                 mRNA:Solyc00g052940.2.1                         <NA>
#          eugene_evidence_code     sifter_term
#                   <character> <CharacterList>
#      [1]                 <NA>                
#      [2]                 <NA>                
#      [3]                 <NA>                
#      [4]                 <NA>                
#      [5]                 <NA>                
#      ...                  ...             ...
#   [4399]                 <NA>                
#   [4400]                 <NA>                
#   [4401]                 <NA>                
#   [4402]                 <NA>                
#   [4403]                 <NA>                
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Furthermore, when read_gff() works, it returns an object very similar to what import() returns, but with subtle differences:

path <- system.file("tests", package="gread")
gff_file <- file.path(path, "sample.gff")
import(gff_file)
# GRanges object with 4 ranges and 9 metadata columns:
#       seqnames               ranges strand |         source     type     score
#          <Rle>            <IRanges>  <Rle> |       <factor> <factor> <numeric>
#   [1]        7 [27221129, 27224842]      - | protein_coding     gene      <NA>
#   [2]        7 [27221134, 27224835]      - | protein_coding      RNA      <NA>
#   [3]        7 [27221134, 27222647]      - | protein_coding     exon      <NA>
#   [4]        7 [27224055, 27224835]      - | protein_coding     exon      <NA>
#           phase           Alias                     ID        Name      length
#       <integer> <CharacterList>            <character> <character> <character>
#   [1]      <NA> ENSG00000005073        ENSG00000005073      HOXA11        3714
#   [2]      <NA>                    RNA:ENST00000006015  HOXA11-001        <NA>
#   [3]      <NA>                 exon:ENST00000006015.2        <NA>        <NA>
#   [4]      <NA>                 exon:ENST00000006015.1        <NA>        <NA>
#                               Parent
#                      <CharacterList>
#   [1]                               
#   [2]           gene:ENSG00000005073
#   [3] protein_coding:ENST00000006015
#   [4] protein_coding:ENST00000006015
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

read_gff(gff_file)
# gff object with 4 ranges and 13 metadata columns:
#       seqnames               ranges strand |         source     feature
#          <Rle>            <IRanges>  <Rle> |    <character> <character>
#   [1]        7 [27221129, 27224842]      - | protein_coding        gene
#   [2]        7 [27221134, 27224835]      - | protein_coding         RNA
#   [3]        7 [27221134, 27222647]      - | protein_coding        exon
#   [4]        7 [27224055, 27224835]      - | protein_coding        exon
#           score     phase           Alias                ID        Name
#       <numeric> <integer>     <character>       <character> <character>
#   [1]      <NA>      <NA> ENSG00000005073   ENSG00000005073      HOXA11
#   [2]      <NA>      <NA>            <NA>   ENST00000006015  HOXA11-001
#   [3]      <NA>      <NA>            <NA> ENST00000006015.2        <NA>
#   [4]      <NA>      <NA>            <NA> ENST00000006015.1        <NA>
#                Parent      length   transcript_id         gene_id
#           <character> <character>     <character>     <character>
#   [1]            <NA>        3714            <NA> ENSG00000005073
#   [2] ENSG00000005073        <NA> ENST00000006015 ENSG00000005073
#   [3] ENST00000006015        <NA> ENST00000006015 ENSG00000005073
#   [4] ENST00000006015        <NA> ENST00000006015 ENSG00000005073
#       transcript_name   gene_name
#           <character> <character>
#   [1]            <NA>      HOXA11
#   [2]      HOXA11-001      HOXA11
#   [3]      HOXA11-001      HOXA11
#   [4]      HOXA11-001      HOXA11
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Why rename the type metadata column feature? Why remove the RNA: or exon: prefixes from the ID metadata column? This breaks standard downstream tools like makeTxDbFromGRanges()!

The Bioconductor ecosystem doesn't need another GFF import function that breaks downstream tools. If you need a GFF import function that returns a gff object, please just make it a simple wrapper to rtracklayer::import.gff() so that any tool that works on the output of the latter also works on the output of the wrapper.

Your read_bed() function has the same issue: it needs to be compatible with rtracklayer::import.bed().

As I tried to explain before, and as Michael Lawrence says here https://github.com/Bioconductor/Contributions/issues/87 in the context of reviewing contributed package annotatr for inclusion to Bioconductor:

The whole idea of Bioconductor is that it's a collaborative project working to construct a cohesive platform. It's not just a repository like CRAN.

Regards, H.

asrinivasan-oa commented 7 years ago

@openanalytics, reading the first response from @hpages I thought we could fix it by adding a .setclass argument which would be FALSE by default so that classes aren't set by default. In gcount, we could set this to TRUE.

But seeing the second reply, I am convinced that we have to move gread entirely inside gcount (just copy pasting) and submit gcount directly. Don't export anythjng related to reading. We should just use ::: instead.

@hpages, read_gff first uses import function from rtracklayer and if it falls falls back to fread and if that fails as well falls back to read.table. I don't know why you get what you get and can't test it as I don't have access to a machine yet.

On bed files, because it was simply faster in my benchmarks, and I had to deal with very very large bed files. I don't have issues with naming them as required by BioC. But, I get it. I have recommended that this package be absorbed into gcount, a package that does counting of rna seq reads the way we prefer.

But now I am not sure if that would be accepted given that BioC ecosystem has tools for that as well.

@openanalytics if that seems to be the case indeed, probably BioC is not the place for read/count/analyse wrappers as we had in mind.

lawremi commented 7 years ago

Hiding gread inside of gcount would just make things worse. It would help to understand the salient features of gread. Is it the scalability? Are the existing import routines too slow? Or is it that you want data kept in data.table form, instead of Bioc data structures? If the former, we can work together to make things better. If the latter, it sounds like this is really meant for the "data.table" platform instead of Bioconductor.

arunsrinivasan commented 7 years ago

Gcount package was necessary to count reads with a few lines of code irrespective of whether the gene annotation file is GTF / GFF / BED and mapped reads in BAM / BED and the way we would like to be counted.

I introduced classes to make sure one doesn't count reads between GFF and GFF, in gcount, for example. These packages are intended to have a higher level of abstraction for use for people with not necessarily huge expertise on the GRanges objects or Galignments objects.

Since I saw a use for loading files with just one function and not lookup two packages depending on whether it is bam or other formats, and thought it would be useful, I decided to export as gread.

Purpose for gread: one function for all 4 formats. Fall back to read.table when rtracklayer doesn't wwork . There have been cases where annotation files are poor quality. Loading bed files is faster. Classes were solely added with gcount in mind. But with all these complications, I think going back to the original idea of just having a counts package is better for every one.

If you still think it will make things worse, could you please explain?

I don't have any requirement to have them in data.table data structure, which I have reiterated multiple times, and have also changed following feedback immediately without issues.

arunsrinivasan commented 7 years ago

Excuse the typos, typing from my ipad.

lawremi commented 7 years ago

Worse only because there would still be redundant functionality, but it would no longer be as modular. The rtracklayer import() function already handles different formats transparently, including BED, GFF, BAM, etc. Most of the time though, users get the annotations from TxDb. summarizeOverlaps() already counts directly from a BAM. I guess that could be extended for BED reads, but I've never encountered that use case. It does sound like trying to read arbitrary tabular text files is something people want from rtracklayer, but it wouldn't be too different from makeGRangesFromDataFrame(yourFavoriteFileReader(file)).

For reference, counting a BAM over BED regions looks like this now:

summarizeOverlaps(import("regions.bed"), BamFile("reads.bam"))
hpages commented 7 years ago

Implementing alternative read counting methods is fine. There is no consensus on what's the best way to do this and Bioconductor already provides many packages for read counting. So a package like gcount has its place in Bioconductor. What we need to understand is why your counting functions cannot take standard objects (i.e. GAlignments and TxDb) as input. I see no reason for that, and so far you've failed to provide them. Unlike for counting methods, there is a strong consensus amongst the Biocondcutor community about what core classes should be used to represent what (e.g. GAlignments for aligned reads, TxDb for gene models, SummarizedExperiment for output of counting methods, etc...). All these core objects have a well-defined API. Unlike for analysis approaches (e.g. counting methods, DE analysis, etc...), there is no place for diversity at this level. That would hurt the project. The community welcomes alternate analysis approaches but keeping the input objects as standard as possible is a must. Because of all the reasons already mentioned earlier (interoperability, user friendliness, etc..) but also because it facilitates greatly comparison between all the existing analysis approaches. H.

hpages commented 7 years ago

We need to move forward.

  1. Please turn read_gff() and read_bed() into simple wrappers around rtracklayer::import.gff() and rtracklayer::import.bed(), respectively. By simple wrappers I mean that they return the same GRanges object but promoted to class gff or bed, respectively.
  2. These simple wrappers are now very thin and add very little value over rtracklayer::import.gff() and rtracklayer::import.bed(), so having them in their own package (gread) is not really justified anymore. It makes sense to move them to gcount. But they would still need to be exported and documented there. Maybe Michael was concerned that you would hide (i.e. not export) them in gcount and have the higher level functions defined in gcount take care of the import step by calling them internally. I agree that this would not be desirable.

Also by getting rid of the gread package we discourage other developers from building their own pipelines on top of your read_* functions. They are really specific to the gcount workflow and should be used only in that context.

Thanks, H.

hpages commented 7 years ago

Hi Arun,

Just wanted to check with you what the status of the package is. Are you planning to follow-up with your submission?

Thanks, H.

arunsrinivasan commented 7 years ago

@hpages I do not work for the company anymore. I've forwarded your message.

hpages commented 7 years ago

Hi Arun,

OK. If nobody is willing to follow up on this submission, I'll close the issue at the end of April.

Thanks, H.