download_chandra_repro and over-writing files

DougBurke commented 4 years ago

I just came across the following, which happened because I had manually unpacked a hand-downloaded set of files from the archive into a directory I'd already run download_chandra_obsid. This meant that the oif.fits file was different, so there was an attempt to re-download it, but this failed because the file was created with read-only permissions.

Things to think about

should we change the permissions of downloaded files to always have u+w?
actually, here I think the code may have tried to append to oif.fits (the assumption is that the file contents are immutable), so perhaps I should special-case oif.fits here
should I not bother worrying over the unusual use case of "user manually unpacking a CDA-provided tarfile into the same directory used by download_chandra_obsid"

% download_chandra_obsid 22916
Downloading files for ObsId 22916, total size is 25 Mb.

  Type     Format      Size  0........H.........1  Download Time Average Rate
  ---------------------------------------------------------------------------
  vv       pdf        10 Mb    already downloaded
  evt2     fits        2 Mb    already downloaded
  asol     fits      922 Kb    already downloaded
  bias     fits      500 Kb    already downloaded
  bias     fits      467 Kb    already downloaded
  bias     fits      446 Kb    already downloaded
  bias     fits      445 Kb    already downloaded
  osol     fits      373 Kb    already downloaded
  osol     fits      367 Kb    already downloaded
  eph1     fits      311 Kb    already downloaded
  eph1     fits      306 Kb    already downloaded
  eph1     fits      288 Kb    already downloaded
  vv       pdf       187 Kb    already downloaded
  mtl      fits      169 Kb    already downloaded
  cntr_img jpg       119 Kb    already downloaded
  stat     fits      104 Kb    already downloaded
  aqual    fits       69 Kb    already downloaded
  full_img jpg        50 Kb    already downloaded
  full_img fits       26 Kb    already downloaded
  cntr_img fits       23 Kb    already downloaded
# download_chandra_obsid (6 February 2020): ERROR Unable to create '22916/oif.fits'

kglotfelty commented 4 years ago

The oif.fits file is special. When you download via chaser, the contents of that file are generated on-the-fly to match the package contents you've selected. The version of the file on the ~ftp site~ public archive is static and matches the standard primary and secondary directories.

So there is an argument to provide a special case for it. Also note: due to FITS blocking in 2880 bytes -- the contents of oif.fits may actually have changed but the file size remains the same. The only way to really know is d/l and checksum|datasum. If it were compressed, it's unlikely the file size would be the same (though interestingly, it could be larger) .

[I'm not necessarily :+1: it -- just providing justification should anyone want to take up the task.]

Of course then this brings up a difference -- from chaser `oif.fits` always matches the contents of the directories. But via `download_chandra_obsid` a user can select a subset of files so the `oif.fits` it retrieves won't match what's on disk. If we ever were to use `oif.fits` then we'd need to maybe think about bringing back `mkoif`.

DougBurke commented 3 years ago

On a slight tangent, with access now via HTTP(s), we have access to more metadata in the query:

% curl -I https://cxc.cfa.harvard.edu/cdaftp/byobsid/4/17484/axaff17484N001_VV001_vv2.pdf HTTP/1.1 200 OK Date: Wed, 18 Nov 2020 03:29:15 GMT Server: Apache Last-Modified: Mon, 23 Feb 2015 00:41:48 GMT ETag: "874d-50fb6aad36b00" Accept-Ranges: bytes Content-Length: 34637 Content-Type: application/pdf Set-Cookie: SERVERID=cxcweb27; path=/ Cache-control: private

If we can believe the last-modified date (and/or ETAG, but we have no sensible place to stash that) then we could use the date to check (but it still doesn't help us determnie if we have a partial download or corrupted download).

hamogu commented 3 years ago

I just saw something very similar in a slightly different setup: I downloaded data with download_chandra_obsids and then run chandra_repro manually. download_chandra_obsid re-downloaded everything when I re-run the notebook (presumably because chandra_repro unzipped all files). I had expected the re-running step to be very fast, because I had expected no new files would be downloaded, since all files (minus the .gz) ending) already existed; instead, I'm sitting here and watch it download the same data again.

For me, that's relevant because I now like to do analysis in the notebook, but I sometimes re-run the entire notebook, just to make sure it still works top to bottom after adding new imports, or after installing another package that need further down. Sure, I can skip the download cell, but it's nice if it "just works".

DougBurke commented 3 years ago

For @hamogu - which is a slightly different issue than my original one which is all about the oif.fits file - the way we do it is

grab the URL and parse the contents to find file (and possibly file size, or maybe that's done in a separate step [it has changed when moving from ftp to http])
recurse through the sub-directories
filter out the files we don't want

ASIDE this is surprisingly fragile and in fact did get broken at one point when the archive tweaked something; it should be less fragile now but it's always going to be somewhat fragile given the use of HTML as the interchange format.

So we have a list of names and file sizes to iterate through:

if the file exists locally and matches the file size we skip
if the file doesn't exist or we haven't downloaded all of it, download the remaining data
if we have too much data we either warn or error out (I can't remember which, but it's not important here)

The problem is that if you've uncompressed the file then we have a disconnect. Fortunately we know that the files are gzip-encoded (ie end in .gz) so we could check the first step to

if the file exists locally and has the right file size OR the file without the ,gz suffix exists

What we will miss is the ability to check the file size of the uncompressed file (as there's no way to get that) but I think if the file exists then we have to assume it's okay (and if it isn't things like CHECKSUM/DATASUM should catch this).

I'd rely on @kglotfelty pointing out I'm doing something stupid before I spend time on this...

kglotfelty commented 3 years ago

Not really adding much here ...

There is some level of confidence that if the file is uncompressed, that it probably is complete since the compressed format has it's own self-consistency checks

%  ls -l foo.fits.gz 
-rw-rw-r-- 1 kjg kjg 17992775 Apr 20 14:53 foo.fits.gz
% dd bs=1 count=600000 if=foo.fits.gz of=moo.fits.gz
600000+0 records in
600000+0 records out
600000 bytes (600 kB) copied, 1.56472 s, 383 kB/s
% ls -l moo.fits.gz
-rw-rw-r-- 1 kjg kjg 600000 Apr 20 14:57 moo.fits.gz
% gunzip moo.fits.gz

gzip: moo.fits.gz: unexpected end of file

I'm sure we can come up with examples where this fails or becomes murky (disk fill during compression).

This presumes that chandra_repro's gunziping works in same way ... which appears to be the case (it using the gzip module)

% python -c 'import gzip;r=gzip.GzipFile("moo.fits.gz","rb");open("bar","wb").write(r.read());'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/export/ciao-4.13/ots/lib/python3.7/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/export/ciao-4.13/ots/lib/python3.7/gzip.py", line 482, in read
    raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached

Unfortunately, when the partial compressed file is used, nothing bad is reported

% dmlist moo.fits.gz blocks

--------------------------------------------------------------------------------
Dataset: moo.fits.gz
--------------------------------------------------------------------------------

     Block Name                          Type         Dimensions
--------------------------------------------------------------------------------
Block    1: PRIMARY                        Null        
Block    2: ASPSOL                         Table        20 cols x 312866   rows

listing the data results in only 10861 rows and the last one has bogus values -- but no error messages as we'd hope :frowning:

cxcsds / ciao-contrib

download_chandra_repro and over-writing files #369