JuliaAstro / FITSIO.jl

Flexible Image Transport System (FITS) file support for Julia
http://juliaastro.org/FITSIO.jl/
MIT License
55 stars 29 forks source link

URL input #123

Closed mileslucas closed 4 years ago

mileslucas commented 4 years ago

Can we stream data in via a url?

e.g.

using FITSIO
url = "https://dr14.sdss.org/optical/spectrum/view/data/format=fits/spec=lite?plateid=1323&mjd=52797&fiberid=12"
f = FITS(url)

If we can't stream it, can we at least let the above look like this->

using FITSIO
url = "https://dr14.sdss.org/optical/spectrum/view/data/format=fits/spec=lite?plateid=1323&mjd=52797&fiberid=12"
# This happens behind the scenes when user calls FITS(url)
f = FITS(download(url))
kbarbary commented 4 years ago

From the CFITSIO docs here:

Files can also be opened over the network using FTP or HTTP protocols by supplying the appropriate URL as the filename. The HTTPS and FTPS protocols are also supported if the CFITSIO build includes the libcurl library. (If the CFITSIO 'configure' script finds a usable libcurl library on your system, it will automatically be included in the build.)

Have you tried a FTP or HTTP address? It might just already work. For HTTPS I'm not sure whether our CFITSIO binary is built with libcurl or not.

giordano commented 4 years ago

For HTTPS I'm not sure whether our CFITSIO binary is built with libcurl or not.

I don't think it is, as there was not a libcurl builder available, but I prepared one a few days ago: https://github.com/JuliaPackaging/Yggdrasil/pull/94

mileslucas commented 4 years ago

I can confirm this works for HTTP address but it is slow

julia> using FITSIO

julia> url = "http://phoenix.astro.physik.uni-goettingen.de/data/HiResFITS/WAVE_PHOENIX-
ACES-AGSS-COND-2011.fits"

julia> @time FITS(url)
44.950477 seconds (11 allocations: 880 bytes)

julia> @time FITS(download(url))
5.762645 seconds (1.61 M allocations: 76.110 MiB, 0.75% gc time)
kbarbary commented 4 years ago

Huh. Hard to think of why it would be so much slower in the first case where CFITSIO is handling the download. Have you tried switching the order you test in, in case it happens to be an artifact of caching on the server side?

giordano commented 4 years ago

@mileslucas I cannot reproduce your problem. In a fresh Julia session:

julia> using FITSIO

julia> url = "http://phoenix.astro.physik.uni-goettingen.de/data/HiResFITS/WAVE_PHOENIX-ACES-AGSS-COND-2011.fits"
"http://phoenix.astro.physik.uni-goettingen.de/data/HiResFITS/WAVE_PHOENIX-ACES-AGSS-COND-2011.fits"

julia> @time FITS(url)
  7.082159 seconds (77.29 k allocations: 4.147 MiB)
File: http://phoenix.astro.physik.uni-goettingen.de/data/HiResFITS/WAVE_PHOENIX-ACES-AGSS-COND-2011.fits
Mode: "r" (read-only)
HDUs: Num  Name     Type   
      1    PRIMARY  Image  

julia> @time FITS(download(url))
  8.950702 seconds (1.52 M allocations: 71.918 MiB, 0.14% gc time)
File: /tmp/jl_mhdydV
Mode: "r" (read-only)
HDUs: Num  Name     Type   
      1    PRIMARY  Image

Maybe you had a temporary Internet flakiness?

giordano commented 4 years ago

For the record, I have a pull request to build CFITSIO with CURL: https://github.com/JuliaPackaging/Yggdrasil/pull/116. With this, I can open URLs via HTTPS, too:

julia> using FITSIO

julia> @time FITS("https://dr14.sdss.org/optical/spectrum/view/data/format=fits/spec=lite?plateid=1323&mjd=52797&fiberid=12")
  2.901508 seconds (77.29 k allocations: 4.147 MiB)
File: https://dr14.sdss.org/optical/spectrum/view/data/format=fits/spec=lite?plateid=1323&mjd=52797&fiberid=12
Mode: "r" (read-only)
HDUs: Num  Name     Type   
      1             Image  
      2    COADD    Table  
      3    SPECOBJ  Table  
      4    SPZLINE  Table
mileslucas commented 4 years ago

Hm, I still get the problem even with multiple iterations. I seem to consistently take 2x longer to directly access the url.

julia> url = "http://phoenix.astro.physik.uni-goettingen.de/data/HiResFITS/WAVE_PHOENIX-ACES-AGSS-COND-2011.fits";

julia> @time FITS(url);
 67.713316 seconds (11 allocations: 880 bytes)

julia> @time FITS(url);
69.841447 seconds (11 allocations: 880 bytes)

julia> @time FITS(url);
 71.165964 seconds (11 allocations: 880 bytes)

julia> @time FITS(download(url));
 26.493157 seconds (124 allocations: 70.188 KiB)

julia> @time FITS(download(url));
 41.499932 seconds (124 allocations: 70.188 KiB)

julia> @time FITS(download(url));
 25.116675 seconds (124 allocations: 70.188 KiB)

So on average 69.6 vs 31.0.

giordano commented 4 years ago

Would be interesting if you could produce the equivalent C code. FITSIO.jl doesn't really do anything different from the plain file case, it just passes the string to cfitsio whatever it is.

Edit:

julia> using FITSIO

julia> url = "http://phoenix.astro.physik.uni-goettingen.de/data/HiResFITS/WAVE_PHOENIX-ACES-AGSS-COND-2011.fits";

julia> @time FITS(url);
  5.632906 seconds (77.29 k allocations: 4.147 MiB)

julia> @time FITS(download(url));
  3.250751 seconds (1.52 M allocations: 71.921 MiB, 1.99% gc time)

Note that also when using download I get much shorter times than you. I think you have some connection problems getting that file

mileslucas commented 4 years ago

Hawai'i internet is probably not helping connecting to the Goettinghem servers :)

I've never used cfitsio. Only astropy.io and this, so I'm not too knowledgeable on producing equivalent C code, unfortunately.

giordano commented 4 years ago

Hawai'i internet is probably not helping connecting to the Goettinghem servers :)

The other day you got the file in about 5 seconds with download, today in 25-40 seconds, which is less than 0.4 MiB/s on avrage. Everything points to connection issues. Maybe try to fetch a FITS file from a US-based server.

I don't think there is anything we can do about this here.

mileslucas commented 4 years ago

Ah, I'm on a different island using wifi right now, but you're right. I'll play around some more and see if I can reproduce my issue. Will close for now.