diegozea / MIToS.jl

A Julia package to analyze protein sequences, structures, and evolutionary information
https://diegozea.github.io/MIToS.jl/stable/
Other
75 stars 18 forks source link

Add progress bar to SplitStockholm #41

Closed ellisvalentiner closed 7 years ago

ellisvalentiner commented 7 years ago

This pull request is to address #40

It was actually a little tricky to find a way to determine the file size because seekend(file) is not supported by GZip.jl. However the trick on lines 48-50 seems to work and does not seem too slow.

coveralls commented 7 years ago

Coverage Status

Coverage remained the same at 89.078% when pulling dcb132810bd73442941c203879f81bacd5276bcd on ellisvalentiner:add-progress-bar into f463f9c51207785e81fd4be5b984a454fff52e9a on diegozea:master.

codecov-io commented 7 years ago

Codecov Report

Merging #41 into master will not change coverage. The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #41   +/-   ##
=======================================
  Coverage   89.07%   89.07%           
=======================================
  Files          42       42           
  Lines        2353     2353           
=======================================
  Hits         2096     2096           
  Misses        257      257

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f463f9c...9621d67. Read the comment docs.

diegozea commented 7 years ago

Thank you!

The trick is too slow in fact. It takes ~ 34 minutes for "Pfam-A.full.gz" Pfam FTP.

julia> const infh = GZip.open("Pfam-A.full.gz")
GZipStream(Pfam-A.full.gz)

julia> @time  maximum.(eachline(infh))
2088.940729 seconds (681.68 M allocations: 177.735 GiB, 1.53% gc time)
"Z_TACVF/36-85             GRYNCKCCWFADKNLITCSDHYLCLRCHQIMLRNSELCNICWKPLPTSIR"

The following takes ~24 minutes instead:

julia> @time while !eof(infh)
                   read(infh, UInt8)
             end
1633.974568 seconds

Could it be useful to use filesize instead?

julia> fileSize = position(infh)
41563568750

shell> gzip --list  Pfam-A.full.gz
         compressed        uncompressed  ratio uncompressed_name
         5985073393          2908863086 -105.8% Pfam-A.full

julia> filesize("Pfam-A.full.gz")
5985073393

I think that using ProgressThresh with filesize could be a good idea. We can keep the sum of output filesizes. The progress bar can be update! with the sum of the sizes of the generated files. It should reach the 100% when the total output size is close to the input file size. What do you think?

coveralls commented 7 years ago

Coverage Status

Coverage remained the same at 89.078% when pulling 879b50283cd13e8fad310cc578c21a97a9a80be8 on ellisvalentiner:add-progress-bar into f463f9c51207785e81fd4be5b984a454fff52e9a on diegozea:master.

ellisvalentiner commented 7 years ago

@diegozea I took your suggestion and updated the PR. I don't have "Pfam-A.full.gz" so would appreciate if you can help me test this.

diegozea commented 7 years ago

Hi @ellisvalentiner ! The script gives me multiple errors. You can download the file from: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.gz

coveralls commented 7 years ago

Coverage Status

Coverage remained the same at 89.078% when pulling dd2419b0508c69e525a85d5d366042e982f4e63e on ellisvalentiner:add-progress-bar into f463f9c51207785e81fd4be5b984a454fff52e9a on diegozea:master.

coveralls commented 7 years ago

Coverage Status

Coverage remained the same at 89.078% when pulling f53d05d769231380c205a0c14acfbd32fdf46e22 on ellisvalentiner:add-progress-bar into f463f9c51207785e81fd4be5b984a454fff52e9a on diegozea:master.

ellisvalentiner commented 7 years ago

@diegozea sorry for all of the syntax errors (not sure what I was thinking)

I have been testing using julia ~/.julia/v0.6/MIToS/scripts/SplitStockholm.jl Pfam-A.full.gz --path=tmp --progress

The progress meter does not appear if I use the for-loop but does using the while-loop. I'm not sure why this is because the example does use a for-loop. I also do not see the files being written using the for-loop but do with the while-loop.

ProgressThresh does not display when the value is below the threshold. Therefore I switched back to Progress. Unfortunately the ETA is not very accurate because the file sizes vary.

coveralls commented 7 years ago

Coverage Status

Coverage remained the same at 89.078% when pulling f53d05d769231380c205a0c14acfbd32fdf46e22 on ellisvalentiner:add-progress-bar into f463f9c51207785e81fd4be5b984a454fff52e9a on diegozea:master.

coveralls commented 7 years ago

Coverage Status

Coverage remained the same at 89.078% when pulling f53d05d769231380c205a0c14acfbd32fdf46e22 on ellisvalentiner:add-progress-bar into f463f9c51207785e81fd4be5b984a454fff52e9a on diegozea:master.

diegozea commented 7 years ago

Thanks @ellisvalentiner ! I like it ! I propose only two minor changes. One change is for showing the progress bar by default. The other solves a problem with the filenames unrelated to the progress bar.

coveralls commented 7 years ago

Coverage Status

Coverage remained the same at 89.078% when pulling 6fd846da112a5f52b55fdbfdd74e2655ff63629b on ellisvalentiner:add-progress-bar into f463f9c51207785e81fd4be5b984a454fff52e9a on diegozea:master.

coveralls commented 7 years ago

Coverage Status

Coverage remained the same at 89.078% when pulling 9621d67f378e816b22bd25b755a2c291bc2cc364 on ellisvalentiner:add-progress-bar into f463f9c51207785e81fd4be5b984a454fff52e9a on diegozea:master.

ellisvalentiner commented 7 years ago

@diegozea I addressed those minor changes. Any other suggestions?

diegozea commented 7 years ago

Thank you very much! It works perfectly!