Bioconductor / Contributions

Contribute Packages to Bioconductor
135 stars 33 forks source link

IntEREst #217

Closed gacatag closed 7 years ago

gacatag commented 7 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 7 years ago

Hi @gacatag

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: IntEREst
Title: Intron-Exon Retention Estimator
Version: 0.99.0
Date: 2016-12-15
Author: Ali Oghabian <Ali.Oghabian@Helsinki.Fi>, 
    Dario Greco <dario.greco@helsinki.fi>, 
    Mikko Frilander <Mikko.Frilander@helsinki.fi>
Maintainer: Ali Oghabian <Ali.Oghabian@Helsinki.Fi>,
 Mikko Frilander <Mikko.Frilander@helsinki.fi>
Description: This package performs Intron-Exon Retention analysis on RNA-seq data (.bam files).
Depends: R (>= 3.3), GenomicRanges, Rsamtools, edgeR, S4Vectors
Imports: seqLogo, Biostrings, GenomicFeatures, IRanges, seqinr, limma,
        graphics, grDevices, stats, utils, grid, methods, DBI, RMySQL
Suggests: clinfun, MDS.Chr22.U12Genes, knitr, Rmpi
VignetteBuilder: knitr
LazyData: true
biocViews: Software, AlternativeSplicing, Coverage,
        DifferentialSplicing
License: GPL-2
NeedsCompilation: no
Packaged: 2016-12-15 17:29:13 UTC; oghabian
bioc-issue-bot commented 7 years ago

Your package has been approved for building. Your package is now submitted to our queue.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR, WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20161215133117.html

lawremi commented 7 years ago

Glanced at this. The interestResult class (should be capitalized) has slots (which should be of type integer) that are subscripts into the columns of another data.frame slot. Why not just store those columns directly as slots? Is it so important that they can be moved around? Btw, the representation arg to setClass() has been deprecated in favor of the slots arg.

bioc-issue-bot commented 7 years ago

We only start builds when the Version field in the DESCRIPTION file is incremented. For example, by changing

Version: 0.99.0

to

Version 0.99.1

If you did not intend to start a build, you don't need to do anything. If you did want to start a build, increment the Version: field and try again.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

ff5865d Update DESCRIPTION

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR, WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20161215141724.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

e9684e6 Update DESCRIPTION

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR, WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20161215145056.html

bioc-issue-bot commented 7 years ago

Hi @gacatag,

Starting build on additional package https://github.com/gacatag/MDS.Chr22.U12Genes.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your additional package repository will NOT trigger a new build.

The DESCRIPTION file of this additional package is:

Package: MDS.Chr22.U12Genes
Title: MDS chromosome 22 data
Version: 0.99.0
Date: 2016-12-15
Author: Ali Oghabian
Maintainer: Ali Oghabian <ali.oghabian@helsinki.fi>
Description: It's a RNA-seq mapped data of MDS patient and healthy samples.
Suggests: GenomicAlignments
LazyData: true
NeedsCompilation: no
biocViews: ExperimentData, Genome, Homo_sapiens_Data, SequencingData,
    GEO, NCI, RNASeqData, ArrayExpress
License: GPL-2
Packaged: 2016-12-15 18:34:00 UTC; oghabian
bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/MDS.Chr22.U12Genes_buildreport_20161215155939.html

vobencha commented 7 years ago

Hi, I see you've uploaded a data package. Have you incorporated the comments Michael made for the software package? If yes, I'll go ahead and do a full review.

Please subscribe to the bioc-devel mailing list:

Valerie

gacatag commented 7 years ago

Hi Valerie ( @vobencha )! No sorry! haven't incorporated Michael's comments yet. His suggestion regarding the class was actually very good. I'm working on it now and will incorporate it soon, and commit the changes.

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

2870526 Changing interestResult class + Correcting referen...

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR, WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20170105101251.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

a050f0f Capitalizing interestResult class name and changin...

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR, WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20170105121649.html

gacatag commented 7 years ago

Dear Valerie ( @vobencha ) Hi! I have now incorporated Michael's ( @lawremi) comments. Feel free to proceed with the review. Cheers! Ali

vobencha commented 7 years ago

Hi,

1) Please fix these issues raised on the single package build report:

Shorten the run time for this example:

** running examples for arch 'i386' ... [220s] OK
Examples with CPU or elapsed time > 5s
                user system elapsed
getRepeatTable 34.02   1.44  211.95

Why so many unevaluated code chunks?

* Checking vignette directory...
    This is a software package
    # of chunks: 13, # of eval=FALSE: 3 (23%)

Optionally add more biocViews so users can find your package more easily:

* Checking for recommended biocViews...
    * NOTE: Consider adding these automatically suggested biocViews:
      Transcription, Sequencing, Alignment, Normalization, Bayesian,
      Clustering, Regression, GeneExpression, RNASeq, ChIPSeq,
      DifferentialExpression, GeneSetEnrichment, BatchEffect,
      MultipleComparison, QualityControl, TimeCourse, DataImport

All exported functions / man pages need runnable examples:

* Checking exported objects have runnable examples...
    * NOTE: Consider adding runnable examples to the following man
      pages which document exported objects:
      annotateU12.Rd, buildSsTypePwms.Rd, interest.Rd,

Please format your code to max 80 characters wide:

* Checking formatting of DESCRIPTION, NAMESPACE, man pages, R source,
  and vignette source...
    * NOTE: Consider shorter lines; 762 lines (14%) are > 80 characters
      long.

2) Remove the build/ and inst/doc/ directories. Both are generated automatically by R CMD build/check and are not needed in the source.

Valerie

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

c686b0e dontrun getRepeatTable.Rd example + New biocViews ...

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20170111105648.html

gacatag commented 7 years ago

Dear Valerie ( @vobencha ) Hi, Thank you for your review and comments.

I have modified the codes and here are the main changes to the package together with my response to your comments:

1.

Shorten the run time for this example: ... getRepeatTable

I changed the example into a \dontrun. The getRepeatTable() function connects to UCSC and builds a data.frame that includes information regarding repetitive DNA sequences in the human genome. Connecting to UCSC, downloading and accessing the repeats file (rmsk) from UCSC consumes most of the time in this function. Even requesting a small family of repetitive elements (e.g. "Merlin" with only 57 reported cases) did not decrease the running time significantly.

Why so many unevaluated code chunks?

  • Checking vignette directory... This is a software package .# of chunks: 13, # of eval=FALSE: 3 (23%)

The 2nd, 3rd, and 4th chunks in the vignette include codes that are time consuming to run (~ 15 minutes). Out of these chunks, the results of the first 2 interest() and interest.sequential() are exactly the same, except that the interest() runs in parallel (on multiple computing cores), hence it is provided as an example, demonstrating how to run the process on multiple cores, in parallel. Due to the large size of most of studied binary alignment mapped (.bam) files I think a lot of users would want to run in parallel mode, and this example would be helpful to them. Please note that the object which is the result of running these 3 chunks, i.e. mdsChr22Obj, is provided as a data inside the package. The remaining (and runnable) chunks inside the vignette document use the data for various downstream analysis and plots.

Optionally add more biocViews so users can find your package more easily:

  • Checking for recommended biocViews...
    • NOTE: Consider adding these automatically suggested biocViews: Transcription, Sequencing, Alignment, Normalization, Bayesian, Clustering, Regression, GeneExpression, RNASeq, ChIPSeq, DifferentialExpression, GeneSetEnrichment, BatchEffect, MultipleComparison, QualityControl, TimeCourse, DataImport

These biocViews were added: Sequencing, RNASeq, Alignment, Normalization, and DifferentialExpression

All exported functions / man pages need runnable examples:

  • Checking exported objects have runnable examples...
    • NOTE: Consider adding runnable examples to the following man pages which document exported objects: annotateU12.Rd, buildSsTypePwms.Rd, interest.Rd,

I added a runnable example to the annotateU12.Rd manual which runs in a reasonable time. However, interest() and interest.sequential() are wrapper functions (that run several other functions) and are relatively time-demanding to run. They read a .bam file N lines at a time (N can be set by the user) and divide it to several smaller files (that can be fully managed by the memory), then the functions continue with counting the number of the reads in the individual files that hit each region (intron or exon), and in the end they sum the results from the smaller files and normalize the read counts. The buildSsTypePwms() and (as mentioned previously in my comments) getRepeatTable() functions are relatively time demanding, they connect to other sources for building data (buildSsTypePwms() connects and downloads data from U12DB or SpliceRack and getRepeatTable() connects to UCSC ). The tables generated by these functions can be used in various functions in the IntEREst package e.g. interes() and interest.sequential().

Please format your code to max 80 characters wide:

  • Checking formatting of DESCRIPTION, NAMESPACE, man pages, R source, and vignette source...
    • NOTE: Consider shorter lines; 762 lines (14%) are > 80 characters long.

All lines in the source files were shortened to (<80 characters) except for 10 lines in the vignettes/IntEREst.Rmd file. These lines are longer due to defining chunk parameters. Apparently the parameters have to be defined in a single line.

2.

Remove the build/ and inst/doc/ directories. Both are generated automatically by R CMD build/check and are not needed in the source.

I removed the build/ and inst/doc/ directories as you suggested.

Thank you again!

Ali

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

b5a031e Version bump

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20170124033153.html

gacatag commented 7 years ago

Dear Valerie ( @vobencha ), Hi! In addition to the changes I mentioned in my previous response I also made a small correction to the R/u12Boxplot.R, hence the latest version bump. Cheers!

vobencha commented 7 years ago

Thanks. I'll have a look at this today. Valerie

vobencha commented 7 years ago

(1) parallel

I would recommend using the BiocParallel or doParallel packages instead of Rmpi. These 2 are much more recent packages and have been developed to be cross-platform compatible. Rmpi has some historical system requirements/issues that are best avoided. Additionally, the code attempts to specify exactly what to sent to each worker and mange the details of communication. This approach is no longer necessary with newer implementations in BiocParallel (build on the paralle and snow packages) and doParallel.

(2) zzz.R

It is not good practice to set global variables. Please explain why you think you need this and I will suggest an alternative.

(3) Smaller data for examples.

The functions need to be evaluated in either the man pages or vignette. The idea is to test the function at least once in the nightly builds. You will need to use smaller subsets of data and specify smaller subsets in the ScanBamParam object. Here are some small bam files you could use in your examples if you need ideas: http://www.bioconductor.org/packages/release/data/experiment/html/pasillaBamSubset.html http://www.bioconductor.org/packages/release/data/experiment/html/RNAseqData.HNRNPC.bam.chr14.html http://www.bioconductor.org/packages/release/data/experiment/html/SCLCBam.html http://www.bioconductor.org/packages/release/data/experiment/html/leeBamViews.html

(4) Code inefficiencies

The code needs to be streamlined and in many cases it looks like you are reinventing the wheel by not reusing the existing infrastructure as it is intended. I will walk through interest() as an example but it is necessary to make these changes in all functions.

interest():

This function is trying to do too much. bamPreprocess() should read in the data but I don't understand why it needs to write out files. If it does, it should write to a location specified by the user. A reasonably clear workflow would consists of these steps:

bamPreprocess():

(i) reassignment This reassignment of functions in Rsamtools to variables of the exactly same name is poor practice (NAMESPACE clash) and not necessary. Please remove the reassignments and use the functions/objects directly from Rsamtools.

BamFile<- Rsamtools::BamFile
ScanBamParam<- Rsamtools::ScanBamParam   
scanBam<- Rsamtools::scanBam   
scanBamFlag<- Rsamtools::scanBamFlag
scanBamWhat<- Rsamtools::scanBamWhat

(ii) hard coded paths This hard-coded approach to writing files is not cross-platform compatible and must be removed. Use a path specified by the user (default) or tmpdir() and tmpfile(). dir.create(paste(outFolder,"read1/", sep="/"))
dir.create(paste(outFolder,"read2/", sep="/")) dir.create(paste(outFolder,"single/", sep="/"))

All references to 'read1', 'read2' and 'single' directories need to be removed, e.g,

paste(outFolder, "/read2/tmpSam_", count, sep=""),

(iii) scanBam vs readGAlignment* functions Why are you using scanBam() instead of readGAlighments(), readGAlignmentPairs() or readGAlignmentsList()? You are re-inventing the wheel in many aspects here.

It looks as though you expect both paired and unpaired data in one file ... ? The readGAlignmentsList() reads in all records with a metadata column on the object ('mate_status') indicating if a read is paired or not. Please look at ?readGAlignments man page. I think you should reuse one of these functions and not scanBam().

(iv) create param once Create the ScanBamParam once, outside the while() loop.

scParam=ScanBamParam(
       what=scanBamWhat()[c(1,3,5,8,13,9, 10, 6, 4, 14, 15)],
       flag=scanBamFlag(isPaired=TRUE,
       isDuplicate=includePairedDuplicate))

interestAnalyse()):

(i) Please use BiocParallel or doParallel instead of Rmpi.

(ii) Remove these reassignments:

subjectHits<-  S4Vectors::subjectHits
queryHits<- S4Vectors::queryHits
GRanges<- GenomicRanges::GRanges
findOverlaps<- GenomicRanges::findOverlaps
IRanges<- IRanges::IRanges

(ii) Many of your lapply's don't need to be loops. Operations on the CIGAR can be vectorized as described here: http://www.bioconductor.org/developers/how-to/efficient-code/

InterestResult object

(i) Please use the SummarizedExperiment class as the output instead of creating your own S4 InterestResult.

Valerie

gacatag commented 7 years ago

Dear Valerie ( @vobencha ), Hi! Thank you for your insightful comments and hints. I'll try to go through them and implement the changes you suggested. For now I would like to mention the following:

*Regarding BiocParallel or doParallel over Rmpi, I will try to use these new packages that are supported by Bioconductor.

*I'll remove the zzz.R file and not define global variables. I added that initially to get rid of a warning during the test but now I didn't get the warning when I removed zzz.R and ran the tests.

*I will replace scanBam() with readGAlignmentPairs() and readGAlignments but regarding the parallel coding in interest():
My goal is to read a single large input .bam file, N mapped-reads at a time and run analysis (Mapped reads summarization or counting reads mapped to each intron and exon) in parallel over n computing cores. In other words, I would like to parallelize over several read chunks from a single .bam file rather than over several .bam files. Because, I would need to speed up the analysis of a single large .bam file and analyze it without loading the whole file into the memory. I would also need to know which reads are paired-mapped (both pair of reads are mapped) and which are single-mapped. To my knowledge, to implement this I either should

I) First read from the .bam file n mapped-reads at a time (first reading paired mapped reads and then the single mapped), write them in temporary files and then after opening the parallel processes distribute these files so that the same analysis would be run on these temp files while passing a value to the nodes that tells whether the analyzed temp file includes single or pair mapped reads., or

II) I should provide index numbers, and implement such that one of the node reads the first n mapped reads from the .bam file, and the rest of the nodes import mapped reads from the same .bam file but skip 1 to k chunks of reads to read a chunk of read that would be analyzed only by that node.

The interest() function right now uses the policy I but now I am planning to change it to II. Feel free to recommend if you know a better solution.

Cheers!

Ali

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

39cec2c Change parallelization from Rmpi to doParallel and...

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

31be60e removing zzz.R and global variables

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20170131121234.html

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20170131121911.html

vobencha commented 7 years ago

Hi, Responding to comments in https://github.com/Bioconductor/Contributions/issues/217#issuecomment-275175414.

I do not think either option I or II you describe are good approaches. Once you modify the code to use BiocParallel or doParallel you won't need to manage jobs at the node/worker level. I strongly advise against doing that - it is error prone and likely not platform independent. This is what I think needs to happen:

Why do you want parallel workers to operate on chunks of the same bam file instead of having each working work through the file in chunks? You will incur much less time/memory overhead if you just let one worker move through one bam file in chunks; the code in GenomicAlignments/Rsamtools has been optimized to do this. You don't want to manage opening a bam file, give one chunk to worker A, save your place in the file, close the file, then worker B opens the file and tries to find where worker A left off.

Valerie

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

017610c Changing parallelization to use BiocParallel & bpi...

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20170202105125.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

e77a47d Removing FUN<-PKG::FUN

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20170202114647.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

38c8ab2 Removing InterestResults S4 class and using Summar... e5317a2 Removing InterestResults S4 class and using Summar...

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20170206120248.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

68194c0 Correcting for no matches of reads to reference

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20170207052744.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

97bdea3 Including new small bam file for test

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20170207084045.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

8a0c5b3 Bringing out ScanBamParam() from the iterations an...

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20170207113532.html

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

c3d81bc suggest MDS.Chr22.U12Genes package

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20170207115228.html

gacatag commented 7 years ago

Dear Valerie ( @vobencha ), Hi! Thank you for your insightful comments and hints. I have addressed all comments in the newest version.

(1) parallel

I would recommend using the BiocParallel or doParallel packages instead of Rmpi...

The package now uses bpiterate (togetehr with SerialParam and MulticoreParam) from the BiocParallel package.

(2) zzz.R

It is not good practice to set global variables. Please explain why you think you need this and I will suggest an alternative.

The global variables are removed.

(3) Smaller data for examples.

The functions need to be evaluated in either the man pages or vignette...

A new small bam file is added to /inst/extdata that is used in the examples in manuals (man/interest.Rd and man/interest.sequential.Rd). They run in a reasonable time.

(4) Code inefficiencies

The code needs to be streamlined and in many cases it looks like you are reinventing the wheel by not reusing the existing infrastructure as it is intended. I will walk through interest() as an example but it is necessary to make these changes in all functions. interest():

This function is trying to do too much. bamPreprocess() should read in the data but I don't understand why it needs to write out files. If it does, it should write to a location specified by the user. A reasonably clear workflow would consists of these steps... is it necessary that interestSummarise be a separate step or could it be combined with interestAnalyse?

The bamPreprocess() is removed. The system is now based on bpiterate() function from the BiocParallel package. No temp files are written anymore and manual controls are no longer performed to determine what parallel core takes care of which chunk of the reads in the input .bam file. The parallel process iterates through various read chunks from the input bam file automatically and analyzes them either on the multiple processes (simultaneously) or a single processing core. The output of the interest and interest.sequential functions are written in a text file (defined by the user) and they can be returned as objects of SummarizedExperiment too. Please note that the user only runs interest() or interest.sequential(); the other functions used in these 2 functions, e.g.interestSummarise() and interestAnalyse() are invisible to the user (they are not exported) and they are only used for modular and more organized coding and modifying of the codes.

bamPreprocess():

(i) reassignment This reassignment of functions in Rsamtools to variables of the exactly same name is poor practice (NAMESPACE clash) and not necessary. Please remove the reassignments and use the functions/objects directly from Rsamtools.

BamFile<- Rsamtools::BamFile ScanBamParam<- Rsamtools::ScanBamParam ...

The reassignments are removed.

(ii) hard coded paths This hard-coded approach to writing files is not cross-platform compatible and must be removed. Use a path specified by the user (default) or tmpdir() and tmpfile(). dir.create(paste(outFolder,"read1/", sep="/")) ...

Hard codes for writing files are removed and no temproray file is written any more.

(iii) scanBam vs readGAlignment* functions Why are you using scanBam() instead of readGAlighments(), readGAlignmentPairs() or readGAlignmentsList()? You are re-inventing the wheel in many aspects here.

It looks as though you expect both paired and unpaired data in one file ... ? The readGAlignmentsList() reads in all records with a metadata column on the object ('mate_status') indicating if a read is paired or not. Please look at ?readGAlignments man page. I think you should reuse one of these functions and not scanBam().

The scanBam() is not used any more; the readGAlignmentsList() is used for reading single mapped reads (i.e. reads that are mapped but their pair did not map) and readGAlignmentPairs() for reading paired reads together. The analysis on the single mapped reads is a bit different from the analysis on the paired reads (e.g. the paired reads mapped to an exon should be counted only once).

(iv) create param once Create the ScanBamParam once, outside the while() loop...

ScanBamParam() is called outside the iteration.

interestAnalyse()):

(i) Please use BiocParallel or doParallel instead of Rmpi.

BiocParallel is used instead of Rmpi.

(ii) Remove these reassignments:

subjectHits<- S4Vectors::subjectHits...

Reassignments are removed.

(ii) Many of your lapply's don't need to be loops. Operations on the CIGAR can be vectorized as described here: http://www.bioconductor.org/developers/how-to/efficient-code/

Loops are removed and the process are vectorized using the lapply() and cumsum() functions in base R.

InterestResult object

(i) Please use the SummarizedExperiment class as the output instead of creating your own S4 InterestResult.

The S4 class InterestResult is removed and SummarizedExperiment is used instead.

Finally, please also note that I have now suggested the data package used in the vignettes, i.e. MDS.Chr22.U12Genes . I added it as an AdditionalPackage on 15 Dec, 2016 https://github.com/gacatag/MDS.Chr22.U12Genes .

Thank you again!

Ali

bioc-issue-bot commented 7 years ago

Received a valid push; starting a build. Commits are:

f702f35 Support single (unpaired) read sequence mapped bam...

bioc-issue-bot commented 7 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the following build report for more details:

http://bioconductor.org/spb_reports/IntEREst_buildreport_20170208075309.html