Bioconductor / Contributions

Contribute Packages to Bioconductor
135 stars 33 forks source link

tRNAscan2GRanges #607

Closed FelixErnst closed 6 years ago

FelixErnst commented 6 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 6 years ago

Hi @FelixErnst

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: tRNAscan2GRanges
Title: Imports a tRNAscan-SE result file as GRanges object
Version: 0.99.0
Authors@R: person("Felix G.M.", 
        "Ernst", 
        email = "felix.gm.ernst@outlook.com", 
        role = c("aut", "cre"))
Description: The package imports the result of tRNAscan-SE as a GRanges object.
Depends: R (>= 3.5)
License: GPL-3 + file LICENSE
Encoding: UTF-8
LazyData: true
biocViews:
  Software,
  WorkflowStep,
  Preprocessing
Imports:
  BiocGenerics,
  Biostrings,
  GenomicRanges,
  GenomeInfoDb,
  assertive,
  S4Vectors
Collate:
  tRNAScan2GRanges.R
RoxygenNote: 6.0.1
Suggests: 
  BiocStyle, 
  knitr,
  rmarkdown,
  testthat
VignetteBuilder: knitr
bioc-issue-bot commented 6 years ago

Your package has been approved for building. Your package is now submitted to our queue.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

lawremi commented 6 years ago

There is a lot of low-hanging fruit for optimization and cleanup. The vignette needs to be developed: what is the tRNA-SE format, where is it encountered, what are some typical things people do with the data once they've imported it? The vignette section on the "Literature" seems to contain the wrong content.

FelixErnst commented 6 years ago

@lawremi I added some more detailed information in the vignette. Could you specify, what you mean with cleanup?

Apart from the little example case scenario, which is now part of the vignette, a typical usage case will be part of a follow up package, but I wanted to split this functionality since it is straight forward and contained.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

be41728 version bump

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

lawremi commented 6 years ago

By cleanup I mean the parser code. It should be vectorized, rather than looping over rows and blocks in R.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

abcc7b8 vectorized text block parsing

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

FelixErnst commented 6 years ago

Thanks for insisting. I previously didn't come up with a vectorized solution. Learned something :)

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

bdc433b version bump f543b13 Merge branch 'master' of https://github.com/FelixE... ed62486 version bump

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

adac41a version bump

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

c8d668d version bump

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

3697940 version bump

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

ed0d0e0 added tRNAscan2GFF function to produce gff3 compli... 6272eb6 removed clashes with gff3 defined columns

FelixErnst commented 6 years ago

I added some visualization to the vignette and removed some clashes with gff3 naming conventions. Currently I don't have anything more to add.

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

FelixErnst commented 6 years ago

@hpages @lawremi Hi. just wanted to ask if you have any additional points I should take care of. Thanks

hpages commented 6 years ago

Hi @FelixErnst , I started to look at tRNAscan2GRanges and will provide some feedback here today. H.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

04bf25f corrected spelling mistakes in the vignette

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

b3a1fb2 - vignette text changes - fixed spelling mistakes

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

FelixErnst commented 6 years ago

Hi @hpages, any news?

FelixErnst commented 6 years ago

Hi @hpages I am sorry to bother you again. Is there any feedback you can provide? Thanks! Felix

hpages commented 6 years ago

Hi @FelixErnst,

I got distracted, sorry.

I took a first look at tRNAscan2GRanges and have some feedback for you. See below.

Please let me know or ask on the bioc-devel mailing list if you have any question or concern about this.

Thanks, H.

1) Interface design and naming:

- `tRNAscan2GFF()` does something like `some_transformation(tRNAscan2GRanges())` i.e. it transforms the GRanges object returned by `tRNAscan2GRanges()` before returning it to the user. Have you considered making this transformation available as a separate function? This would allow the user who called `tRNAscan2GRanges()` to switch to the GFF3-compliant format without having to reload the data from the file, which could be costly if the file is big. Also, in contexts where the original file is not accessible anymore, one could still transform an object that was previously loaded with `tRNAscan2GRanges()`.

- The functionality provided by `tRNAscan2GFF()` could be merged into `tRNAscan2GRanges()` by adding a toggle to the former e.g. an `as.GFF` argument (`FALSE` by default).

- The name of the `tRNAscan2GRanges()` function makes it sound like it performs some kind of coercion when it's in fact an I/O function. `read_tRNAscan()` or `import_tRNAscan()` would be better names for it. Or `read_tRNAscanAsGRanges()` if you want the name to convey the type of object that is returned (but is that important?)

- Now that the _tRNAscan2GFF_ name is available, it could be used for the function that transforms the GRanges object returned by `read_tRNAscan()` to the GFF3-compliant format.

- To summarize, `read_tRNAscan()` would do:

        tRNAscan <- .read_tRNAscan_from_file(...)  # Does what current tRNAscan2GRanges()
                                                   # does but is now internal
        if (as.GFF)
            tRNAscan <- tRNAscan2GFF(tRNAscan)
        return(tRNAscan)

- So the user still sees 2 functions, `read_tRNAscan()` and `tRNAscan2GFF()`, but now they provide orthogonal functionalities and they also have better names.

- Since the _tRNAscan2GRanges_ package was named after its central function, it feels like maybe it should also be renamed. One obstacle though is that underscores are not allowed in package names, and _readtRNAscan_ is not very readable. So what about something like _tRNAscanImporter_ or _tRNAscanReader_?

> "There are only two hard things in Computer Science: cache invalidation and naming things."
> http://www.meerkat.com/2017/12/naming-things-hard/

2) The automatic names on the returned GRanges object are not helpful or meaningful so should be removed.

3) Use the right storage for the right type of data:

- Metadata columns `phase`, `tRNA_length`, `tRNA_anticodon.start`, `tRNA_anticodon.end`, `tRNAscan_intron.start`, `tRNAscan_intron.end`, `tRNAscan_intron.locstart`, and `tRNAscan_intron.locend`, should be integer vectors (so with NAs for missing values).

- The `source` and `type` metadata columns should be factors.

- The `score` metadata column should be a numeric column (so with NAs instead of dots for missing scores).

- Note that this will bring the GFF3-compliant GRanges object closer to what `rtracklayer::import.gff3()` returns. It's actually interesting to compare the object one gets after exporting and re-importing it (with `rtracklayer::export.gff3()` / `rtracklayer::import.gff3()`) with the original object.

4) Coding:

- Something already mentioned by Michael is that your code could avoid loops and be more efficient by taking advantage of the fact that many things in R are vectorized and fast. For example:

        ## Instead of:
        vapply(intron_length, function(l) if (l > 0) {l + 1} else 0, numeric(1))
        ## do:
        ifelse(intron_length > 0, intron_length + 1, 0)
        ## which is much faster.

- Don't place curly braces around the call to `standardGeneric()` in the definition of your S4 generics. This makes them look as non-standard generic functions. I don't think that's what you want.

- Please do not use `::` inside your own package for calling things defined inside it. This is not needed and is actually discouraged.

5) When opening a man page with ? in a terminal, text is not properly formatted which makes it hard to read e.g.:

        Description:

          ‘gettRNAscanSummary()’: creates an DataFrame with aggregates of
          information ‘plottRNAscanSummary()’: If ggplot2 is installed a
          plot of the data is generated

        ...

        Value:

          ‘gettRNAscanSummary()’: DataFrame ‘plottRNAscanSummary()’: list of
          ggplots per column of data

- The `plottRNAscanSummary()` example should really show a plot. Right now it doesn't.

6) Vignette:

- "Importing as GRanges" would be more accurate than "Converting to GRanges".

- Explicitly do `library(Biostrings)` before using its functionalities (doing `library(rtracklayer)` does not attach Biostrings to the `search()` path).

- Once you've taken care of explicitly calling `library()` on each package you're going to use, you don't need (and should not) use fully qualified names like `Biostrings::writeXStringSet`. They're distracting in a vignette. Same thing for the man pages.

- When opening the HTML vignette in a browser, the output of `head(gr, 2)` is too wide and doesn't fit in the grey box so is wrapped. Making the browser window wider doesn't help (it doesn't make the body of the vignette wider, which is unfortunate). Maybe report this issue or ask for help to the BiocStyle people: https://github.com/Bioconductor/BiocStyle
FelixErnst commented 6 years ago

@hpages Thanks for the feedback.

In any case: After renaming the package, the github URL would change. Would I need to resubmit it under the new name?

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

64b9620 restructuring based on BioC submission comments

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

f6bd49e BiocCheck fixes

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

FelixErnst commented 6 years ago

First things first: Thanks again for the feedback. First submissions are always tricky and the feedback is so important to learn from ones "mistakes".

  1. I renamed the package to tRNAscanImport and the main function import.tRNAscanAsGRanges analogous to import.gff3 from the rtracklayer package . As suggested a direct conversion to a gff3 compatible GRanges object can be triggered by setting as.gff3 = TRUE and tRNAscan2GFF expects a GRanges as input, so it can be converted from compatible input independently of the origin of the GRanges object.

  2. Do you mean the ID column? That is a bit of functionality, which one has to have, since tRNA have sometimes multiple copies per chromosome. I followed the naming convention SGD uses. In my opinion, that is the most robust naming scheme, to ensure unique IDs.

  3. Done on the types. Apart from the numeric/integer values, the export.gff3/import.gff3 looks much alike the output of tRNAscan2GFF.

  4. Done

  5. Done

  6. Regarding the width issue: I will open an issue one the BiocStyle repo. The reason for the wrapping probably is, that the character vector in tRNA_str does not have a space in in it. So I don't know, how that can be resolved in BiocStyle. It could be solved by extending the XStringSet for something like DotBStringSet (dot bracket annotation of nucleotide secondary structure), which would than behave like DNAStringSet masking longer sequences.

Since the build broke because of some path not found, I assume it was because of the renaming. The Webhook also does not work anymore, since the BioC build system does not know about the package with the new name. ("Sorry, you haven't told us about this repository, please go to https://github.com/Bioconductor/Contributions/issues/new .")

I opened another issue with renamed package: https://github.com/Bioconductor/Contributions/issues/645 Is that the right procedure or should/can I do something else?

hpages commented 6 years ago

Yes, opening a new issue is the right procedure. I'll close this one and follow-up on the new one. Thanks!