Closed ellisvalentiner closed 7 years ago
Merging #41 into master will not change coverage. The diff coverage is
n/a
.
@@ Coverage Diff @@
## master #41 +/- ##
=======================================
Coverage 89.07% 89.07%
=======================================
Files 42 42
Lines 2353 2353
=======================================
Hits 2096 2096
Misses 257 257
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update f463f9c...9621d67. Read the comment docs.
Thank you!
The trick is too slow in fact. It takes ~ 34 minutes for "Pfam-A.full.gz" Pfam FTP.
julia> const infh = GZip.open("Pfam-A.full.gz")
GZipStream(Pfam-A.full.gz)
julia> @time maximum.(eachline(infh))
2088.940729 seconds (681.68 M allocations: 177.735 GiB, 1.53% gc time)
"Z_TACVF/36-85 GRYNCKCCWFADKNLITCSDHYLCLRCHQIMLRNSELCNICWKPLPTSIR"
The following takes ~24 minutes instead:
julia> @time while !eof(infh)
read(infh, UInt8)
end
1633.974568 seconds
Could it be useful to use filesize
instead?
julia> fileSize = position(infh)
41563568750
shell> gzip --list Pfam-A.full.gz
compressed uncompressed ratio uncompressed_name
5985073393 2908863086 -105.8% Pfam-A.full
julia> filesize("Pfam-A.full.gz")
5985073393
I think that using ProgressThresh
with filesize
could be a good idea. We can keep the sum of output filesize
s. The progress bar can be update!
with the sum of the sizes of the generated files. It should reach the 100% when the total output size is close to the input file size. What do you think?
@diegozea I took your suggestion and updated the PR. I don't have "Pfam-A.full.gz" so would appreciate if you can help me test this.
Hi @ellisvalentiner ! The script gives me multiple errors. You can download the file from: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.full.gz
@diegozea sorry for all of the syntax errors (not sure what I was thinking)
I have been testing using julia ~/.julia/v0.6/MIToS/scripts/SplitStockholm.jl Pfam-A.full.gz --path=tmp --progress
The progress meter does not appear if I use the for-loop but does using the while-loop. I'm not sure why this is because the example does use a for-loop. I also do not see the files being written using the for-loop but do with the while-loop.
ProgressThresh
does not display when the value is below the threshold. Therefore I switched back to Progress
. Unfortunately the ETA is not very accurate because the file sizes vary.
Thanks @ellisvalentiner ! I like it ! I propose only two minor changes. One change is for showing the progress bar by default. The other solves a problem with the filenames unrelated to the progress bar.
@diegozea I addressed those minor changes. Any other suggestions?
Thank you very much! It works perfectly!
This pull request is to address #40
It was actually a little tricky to find a way to determine the file size because
seekend(file)
is not supported by GZip.jl. However the trick on lines 48-50 seems to work and does not seem too slow.