Bioconductor / Contributions

Contribute Packages to Bioconductor
133 stars 33 forks source link

(inactive) SpiderSeqR #1464

Closed ecballium closed 4 years ago

ecballium commented 4 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 4 years ago

Hi @ecballium

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: SpiderSeqR
Version: 0.99.0
Title: A Tool for Integration of Big Bio Data
Authors@R: c(person("A.", "Sozanska", role = c("aut", "cre"), email = "as2382@cam.ac.uk"), person("C.", "Fletcher", role = "aut"),person("S.", "Samarajiwa", role = "aut", email = "ss861@cam.ac.uk"))
Description: Integration of sequencing metadata from SRA and GEO for automatic search and information retrieval.
License: Artistic-2.0
Encoding: UTF-8
LazyData: true
Suggests: 
    knitr,
    rmarkdown,
    testthat (>= 2.1.0)
VignetteBuilder: knitr
biocViews: Infrastructure, Sequencing, Microarray
Depends:
    R (>= 3.6)
Imports:
    plyr (>= 1.8),
    dplyr (>= 0.7),
    DBI (>= 0.7),
    RSQLite (>= 2.0),
    methods,
    magrittr,
    tidyr,
    crayon,
    cli,
    SRAdb (>= 1.40),
    GEOmetadb (>= 1.40),
    utils,
    stats
RoxygenNote: 6.1.1

Add SSH keys to your GitHub account. SSH keys will are used to control access to accepted Bioconductor packages. See these instructions to add SSH keys to your GitHub account.

bioc-issue-bot commented 4 years ago

A reviewer has been assigned to your package. Learn what to expect during the review process.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 4 years ago

Received a valid push; starting a build. Commits are:

a96ace8 2020-04-03 version bump

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 4 years ago

Received a valid push; starting a build. Commits are:

db2658d 2020-04-03 changed method for saving data

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 4 years ago

Received a valid push; starting a build. Commits are:

daa1694 2020-04-03 reverted back to use_data()

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

ecballium commented 4 years ago

Dear @hpages

I currently have the following warning:

This is a result of relying on usethis::use_data(), which requires R version dependency to be specified. I tried going around this problem by using save() for my data, but it led to similar warnings. I hope that this is ok? Otherwise I would be very grateful for any suggestions on what to change.

Looking forward to your comments about SpiderSeqR!

hpages commented 4 years ago
> startSpiderSeqR("startSpiderSeqR_db")
────────────────────────────────────────────────────────────────────────────────
Welcome to SpiderSeqR!!!
Please bear with us while we download and prepare database files...
NOTE: this might take some time and will require about 100GB of disc space
────────────────────────────────────────────────────────────────────────────────
...Setting up the database connections...
Location of database files: /home/hpages/Desktop/pkgreviews/SpiderSeqR.review/startSpiderSeqR_db
Using the following expiry dates for databases 
(max. number of days since file creation date):
SRA: 90 days
GEO: 90 days
SRR_GSM: 90 days
────────────────────────────────────────────────────────────────────────────────
The file SRAmetadb.sqlite was not found in the current working directory
Would you like to download the file now?

1: yes
2: no

Selection: 

mmh... how a package that requires to download 100GB of data can possibly be a good idea? It doesn't really matter if one can do amazing things after that, nobody will do it.

ecballium commented 4 years ago

Dear Hervé (@hpages),

The total disk space usage of the SpiderSeqR package with all prerequisites installed is currently 41.7 GB. This includes the SRAdb and GEOmetadb database files as well as a smaller (100 MB) conversion table generated by SpiderSeqR. Both prerequisite databases are updated regularly and their size is growing over time, hence the conservative estimate/warning of 100 GB.

These two database files constitute 41.6 of the 41.7 GB and are essential for the operation of two well-established Bioconductor packages (SRAdb and GEOmetadb). At 473 and 235 monthly downloads respectively, SRAdb and GEOmetadb currently position themselves as the 202nd and the 303rd most downloaded packages (out of 2118 Bioconductor packages), which falls well into the top 20%. Moreover, users who might be interested in using SpiderSeqR are likely to already have these packages installed.

I hope that you will agree with me that downloading the database files is the best way to integrate the two databases and expand the functionality of the two existing packages.

hpages commented 4 years ago

OK so it seems that your "NOTE: this might take some time and will require about 100GB of disk space" is really counter productive here. It doesn't tell the most important thing that most people want to know before they start a download which is the size of the download. The files to download are actually compressed and much smaller than what they will occupy on disk after uncompression. How would a first time user know that? It maybe interesting to know that 100GB of disk are going to be required (even if it's not really true, according to your clarifications) but it's even more important to know that this doesn't mean that the size of the download is 100GB!

Also you display this NOTE unconditionally, that is, even if the user already has the files in the directory specified in the call to startSpiderSeqR(), which doesn't make sense.

Another problem with startSpiderSeqR() is that it's no longer able to find the files if it was previously called with the wrong location:

> library(SpiderSeqR)
> dir(pattern="*.sqlite")
[1] "GEOmetadb.sqlite" "SRAmetadb.sqlite"

> startSpiderSeqR(dir=tempdir())  # wrong location
...
1: yes
2: no

Selection: 2
Error in startSpiderSeqR(dir = tempdir()) : 
  SRAmetadb.sqlite file is necessary for the functioning of the package

> startSpiderSeqR(dir=".")  # correct location
────────────────────────────────────────────────────────────────────────────────
Welcome to SpiderSeqR!!!
Please bear with us while we download and prepare database files...
NOTE: this might take some time and will require about 100GB of disc space
────────────────────────────────────────────────────────────────────────────────
...Setting up the database connections...
Location of database files: /tmp/RtmplV6bLZ
Using the following expiry dates for databases 
(max. number of days since file creation date):
SRA: 90 days
GEO: 90 days
SRR_GSM: 90 days
────────────────────────────────────────────────────────────────────────────────
The file SRAmetadb.sqlite was not found in the current working directory
Would you like to download the file now?

1: yes
2: no

Selection: 

Finally note that the name of the function is misspelled in the vignette (startSpideR instead of startSpiderSeqR).

The startSpiderSeqR() step is such an important and sensitive step in your workflow that you really want to make sure that it does the right thing and displays the right thing. This will avoid potential users to start on the wrong foot with the package.

Thanks, H.

bioc-issue-bot commented 4 years ago

Received a valid push; starting a build. Commits are:

23567eb 2020-04-23 version bump

ecballium commented 4 years ago

Dear Hervé (@hpages),

I have made the following changes:

I have also added a tidier progress bar for creation of the conversion database.

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 4 years ago

Received a valid push; starting a build. Commits are:

4213277 2020-04-23 R version dependency amended

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

hpages commented 4 years ago

enabled searching for database files within the specified directory and a few levels above

That is insane! Now startSpiderSeqR() is scanning my entire hard drive trying to find the files. Do you realize that on a server this means that it will also try to scan other user homes and/or other sensible places where it has no business? In some paranoid institutions this kind of activity might even be detected by the security people and get me into trouble.

Anyway, why on earth would it do that if I asked it to look in a specific directory (e.g. in /my/files). If I wanted it to start searching 2 levels above then I would have specified / instead.

But the worst part of this new feature is that startSpiderSeqR() starts searching the files 2 levels above the specified directory EVEN if the files are in the specified directory:

> library(SpiderSeqR)
> dir(pattern="*.sqlite")
[1] "GEOmetadb.sqlite" "SRAmetadb.sqlite"
> startSpiderSeqR(".")
────────────────────────────────────────────────────────────────────────────────
Welcome to SpiderSeqR!!!
Please wait while the database files and connections are being configured...
────────────────────────────────────────────────────────────────────────────────
Searching for database files in: 
/home/biocbuild/sandbox
Some files not found in the specified directory
────────────────────────────────────────────────────────────────────────────────
Expanding the search to within:
. and 2 directory levels above

So after 5 minutes it's still running and it might take hours before it finds the files, which is ridiculous! I actually fear the worst, which is that, after a few hours, it will find old stale SRAmetadb.sqlite or GEOmetadb.sqlite files laying around that I certainly don't trust and don't want it to use. Anyway I don't have time to wait and see what's going to happen so I just did CTRL+C.

Sorry but you need to remove this feature. It's a terrible idea.

Thanks, H.

bioc-issue-bot commented 4 years ago

Received a valid push; starting a build. Commits are:

e30a926 2020-04-23 disabled extended search in startSpider...

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ABNORMAL". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

ecballium commented 4 years ago

Dear Hervé (@hpages),

The feature was meant to ensure that the SpiderSeqR package is easier to setup. However, I understand that the 'extended directory search' may attempt to access sensitive data whilst also taking excessively long to complete. I have disabled this functionality such that SpiderSeqR will now only attempt to locate files under the specified path and its subdirectories.

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

lshep commented 4 years ago

Please do a version bump to ensure we have a build report for the latest version of package updates.

bioc-issue-bot commented 4 years ago

Received a valid push; starting a build. Commits are:

e120a26 2020-05-08 version bump and minor documentation ch...

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

ecballium commented 4 years ago

Dear @lshep, @hpages

The version is now bumped up. The package is unchanged except for minor modifications to the README file and the vignette.

hpages commented 4 years ago

Thanks @ecballium for the improvements to the startSpiderSeqR() function. There are still many problems with it:

1 The bug I reported earlier about the function no longer being able to find the files if it was previously called with the wrong location is still present.

2 I still don't understand why the function displays this

NOTE: The total download size of all three files is about 10-15 GB (compressed)

given that:

Or am I missing something?

3 The function fails for me:

> startSpiderSeqR("SpiderSeqR_dbs")
────────────────────────────────────────────────────────────────────────────────
Welcome to SpiderSeqR!!!
Please wait while the database files and connections are being configured...
────────────────────────────────────────────────────────────────────────────────
Searching for database files within: 
/home/hpages/sandbox/SpiderSeqR_dbs
The required files could not be found (SRAmetadb.sqlite, GEOmetadb.sqlite, SRR_GSM.sqlite)
You will shortly be prompted to download/generate the missing files
────────────────────────────────────────────────────────────────────────────────
NOTE: The total download size of all three files is about 10-15 GB (compressed)
requiring about 40-50 GB disc space after extraction 
(these numbers may change as the databases keep growing)
────────────────────────────────────────────────────────────────────────────────
The file SRAmetadb.sqlite was not found in the specified directories
Would you like to download the file now?

1: yes
2: no

Selection: 1
Downloading the file
Setting options('download.file.method.GEOquery'='auto')
Setting options('GEOquery.inmemory.gpl'=FALSE)
trying URL 'https://s3.amazonaws.com/starbuck1/sradb/SRAmetadb.sqlite.gz'
Content type 'binary/octet-stream' length 1748812493 bytes (1667.8 MB)
==================================================
downloaded 1667.8 MB

Unzipping...

Error in dbConnect(SQLite(), unzippedlocalfile) : 
  could not find function "dbConnect"

> sessionInfo()
R version 4.0.0 RC (2020-04-19 r78255)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] SpiderSeqR_0.99.7

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6        xml2_1.3.2          magrittr_1.5       
 [4] BiocGenerics_0.35.2 hms_0.5.3           tidyselect_1.1.0   
 [7] R6_2.4.1            rlang_0.4.6         GEOquery_2.57.0    
[10] fansi_0.4.1         dplyr_0.8.5         tools_4.0.0        
[13] parallel_4.0.0      Biobase_2.49.0      cli_2.0.2          
[16] SRAdb_1.51.0        ellipsis_0.3.1      assertthat_0.2.1   
[19] tibble_3.0.1        lifecycle_0.2.0     crayon_1.3.4       
[22] purrr_0.3.4         readr_1.3.1         tidyr_1.1.0        
[25] vctrs_0.3.0         bitops_1.0-6        RCurl_1.98-1.2     
[28] glue_1.4.1          limma_3.45.0        compiler_4.0.0     
[31] pillar_1.4.4        pkgconfig_2.0.3    

This is the kind of error that makes me wonder if anybody has ever used this software for real.

4 Also during the creation of the "cutstom database" my screen gets filled with hundreds of lines:

Would you like to create a cutstom database for converting between GEO and SRA? 
This might take a little while, but it is necessary for the correct functioning 
of the package.

1: yes
2: no

Selection: 1
Database will be created shortly
Please wait, creating the custom database...
[|||                                                                           ]
[||||                                                                          ]
[||||                                                                          ]
[||||                                                                          ]
...
...
[hundreds of lines later]
[||||||||||||||||||||                                                          ]
[||||||||||||||||||||                                                          ]
[||||||||||||||||||||                                                          ]
[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]  
────────────────────────────────────────────────────────────────────────────────
Further information on SRAmetadb.sqlite database:
c("schema version", "creation timestamp")
 c("1.0", "2020-05-18 00:29:46")
────────────────────────────────────────────────────────────────────────────────
Further information on GEOmetadb.sqlite database:
c("schema version", "creation timestamp")
 c("1.0", "2020-05-14 22:48:52")
────────────────────────────────────────────────────────────────────────────────
SpiderSeqR setup complete

Looks like you are trying to reinvent the wheel with your .progressBar() function instead of reusing known well-tested solutions like txtProgressBar() from the utils package. And not surprisingly you end up with clumsy under-tested code like anybody who tried to reinvent the wheel before you.

I looked at startSpiderSeqR()'s code and it ain't pretty. As reported by R CMD BiocCheck this is a big 593-line function that performs many different tasks. This is not how you are supposed to write good code. The function needs to be broken down in small independent tasks and each task needs to be implemented in its own separate helper function. With the final goal that the body of startSpiderSeqR() will be reduced to less than 50 lines and will be basically performing each task by calling the corresponding helper. The enormous benefit you will get out of this is that each helper function will now be a small function focused on one simple task so will be much easier to test, debug, and maintain in the long run.

Hope this helps, H.

bioc-issue-bot commented 4 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

lshep commented 4 years ago

Do you plan on updating your package to @hpages comments above? We expect packages to be actively worked on and updated within 2-3 weeks. If you need more time to update your package we can mark the review as inactive and close until you have more time.

ecballium commented 4 years ago

Dear @lshep, Apologies for the delay in the response. I am just at the stage of finalising the changes and will commit soon. Hope that is ok!

lshep commented 4 years ago

We will close this issue for now for inactivity. When you are ready to submit changes and move forward with the review process please leave a message here and tag us in it and we can reopen the issue for review. We look forward to having your package in Bioconductor.

bioc-issue-bot commented 4 years ago

This issue is being closed because there has been no progress for an extended period of time. You may reopen the issue when you have the time to actively participate in the review / submission process. Please also keep in mind that a package accepted to Bioconductor requires a commitment on your part to ongoing maintenance.

Thank you for interest in Bioconductor.