databio / GenomicDistributions

Calculate and plot distributions of genomic ranges
http://code.databio.org/GenomicDistributions
Other
25 stars 10 forks source link

Build reference data examples require too large of downloads. #175

Closed nsheff closed 2 years ago

nsheff commented 2 years ago

When running:

devtools::run_examples(...)

I get:

> CElegansUrl = "http://ftp.ensembl.org/pub/release-103/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz"

> CElegansChromSizes = getChromSizesFromFasta(CElegansUrl)
File will be saved in: /tmp/RtmpeMhR9D/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz
trying URL 'http://ftp.ensembl.org/pub/release-103/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz'
Content type 'application/x-gzip' length 30316631 bytes (28.9 MB)
==
downloaded 1.7 MB

Error in download.file(url = source, destfile = destFile) : 
  download from 'http://ftp.ensembl.org/pub/release-103/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz' failed
In addition: Warning messages:
1: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead. 
2: In download.file(url = source, destfile = destFile) :
  downloaded length 1808336 != reported length 30316631
3: In download.file(url = source, destfile = destFile) :
  URL 'http://ftp.ensembl.org/pub/release-103/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz': Timeout of 60 seconds was reached

I do not think it's appropriate to download external files like this for examples, especially if they may take more than 60 seconds to download. the problem is this makes us dependent on 1) network connection and 2) stability of that third-party server.

Can you either refactor these examples to use local files, or mark them as don't run so they don't cause this kind of error?

nsheff commented 2 years ago

Here's another error I get when running it sometimes:

``

CElegansUrl = "http://ftp.ensembl.org/pub/release-103/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz" CElegansChromSizes = getChromSizesFromFasta(CElegansUrl) File will be saved in: /tmp/RtmpN2Yk7c/working_dir/RtmpDeubMI/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz trying URL 'http://ftp.ensembl.org/pub/release-103/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz' Content type 'application/x-gzip' length 30316631 bytes (28.9 MB)

downloaded 4.1 MB

Warning in download.file(url = source, destfile = destFile) : downloaded length 4258647 != reported length 30316631 Warning in download.file(url = source, destfile = destFile) : URL 'http://ftp.ensembl.org/pub/release-103/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz': status was 'Transferred a partial file' Error in download.file(url = source, destfile = destFile) : download from 'http://ftp.ensembl.org/pub/release-103/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz' failed Calls: getChromSizesFromFasta -> retrieveFile -> download.file Execution halted

kkupkova commented 2 years ago

I made those as don't run, but then BioCheck gives following error:

* Checking man page documentation...
    * ERROR: At least 80% of man pages documenting exported objects must have runnable examples. The following
      pages do not:
      binBSGenome.Rd, calcDinuclFreq.Rd, calcDinuclFreqRef.Rd, calcGCContent.Rd, calcGCContentRef.Rd,
  getChromSizesFromFasta.Rd, getGeneModelsFromGTF.Rd, getTssFromGTF.Rd, loadBSgenome.Rd, loadEnsDb.Rd,
  retrieveFile.Rd
    * NOTE: Usage of dontrun{} / donttest{} found in man page examples.
      19% of man pages use one of these cases.
      Found in the following files:
        binBSGenome.Rd
        calcDinuclFreq.Rd
        calcDinuclFreqRef.Rd
        calcGCContent.Rd
        calcGCContentRef.Rd
        getChromSizesFromFasta.Rd
        getGeneModelsFromGTF.Rd
        getTssFromGTF.Rd
        loadBSgenome.Rd
        loadEnsDb.Rd
        plotDinuclFreq.Rd
        retrieveFile.Rd
    * NOTE: Use donttest{} instead of dontrun{}.
      Found in the following files:
        binBSGenome.Rd
        calcDinuclFreq.Rd
        calcDinuclFreqRef.Rd
        calcGCContent.Rd
        calcGCContentRef.Rd
        getChromSizesFromFasta.Rd
        getGeneModelsFromGTF.Rd
        getTssFromGTF.Rd
        loadBSgenome.Rd
        loadEnsDb.Rd
        plotDinuclFreq.Rd
        retrieveFile.Rd

Maybe we can add a small GTF and FASTA files to extdata. (but we are already getting warnings about the size of the package).