Open DougBurke opened 4 years ago
The oif.fits
file is special. When you download via chaser, the contents of that file are generated on-the-fly to match the package contents you've selected. The version of the file on the ~ftp site~ public archive is static and matches the standard primary
and secondary
directories.
So there is an argument to provide a special case for it. Also note: due to FITS blocking in 2880 bytes -- the contents of oif.fits
may actually have changed but the file size remains the same. The only way to really know is d/l and checksum|datasum. If it were compressed, it's unlikely the file size would be the same (though interestingly, it could be larger) .
[I'm not necessarily :+1: it -- just providing justification should anyone want to take up the task.]
On a slight tangent, with access now via HTTP(s), we have access to more metadata in the query:
% curl -I https://cxc.cfa.harvard.edu/cdaftp/byobsid/4/17484/axaff17484N001_VV001_vv2.pdf HTTP/1.1 200 OK Date: Wed, 18 Nov 2020 03:29:15 GMT Server: Apache Last-Modified: Mon, 23 Feb 2015 00:41:48 GMT ETag: "874d-50fb6aad36b00" Accept-Ranges: bytes Content-Length: 34637 Content-Type: application/pdf Set-Cookie: SERVERID=cxcweb27; path=/ Cache-control: private
If we can believe the last-modified date (and/or ETAG, but we have no sensible place to stash that) then we could use the date to check (but it still doesn't help us determnie if we have a partial download or corrupted download).
I just saw something very similar in a slightly different setup: I downloaded data with download_chandra_obsids
and then run chandra_repro
manually. download_chandra_obsid
re-downloaded everything when I re-run the notebook (presumably because chandra_repro
unzipped all files). I had expected the re-running step to be very fast, because I had expected no new files would be downloaded, since all files (minus the .gz
) ending) already existed; instead, I'm sitting here and watch it download the same data again.
For me, that's relevant because I now like to do analysis in the notebook, but I sometimes re-run the entire notebook, just to make sure it still works top to bottom after adding new imports, or after installing another package that need further down. Sure, I can skip the download cell, but it's nice if it "just works".
For @hamogu - which is a slightly different issue than my original one which is all about the oif.fits file - the way we do it is
ASIDE this is surprisingly fragile and in fact did get broken at one point when the archive tweaked something; it should be less fragile now but it's always going to be somewhat fragile given the use of HTML as the interchange format.
So we have a list of names and file sizes to iterate through:
The problem is that if you've uncompressed the file then we have a disconnect. Fortunately we know that the files are gzip-encoded (ie end in .gz) so we could check the first step to
What we will miss is the ability to check the file size of the uncompressed file (as there's no way to get that) but I think if the file exists then we have to assume it's okay (and if it isn't things like CHECKSUM/DATASUM should catch this).
I'd rely on @kglotfelty pointing out I'm doing something stupid before I spend time on this...
Not really adding much here ...
There is some level of confidence that if the file is uncompressed, that it probably is complete since the compressed format has it's own self-consistency checks
% ls -l foo.fits.gz
-rw-rw-r-- 1 kjg kjg 17992775 Apr 20 14:53 foo.fits.gz
% dd bs=1 count=600000 if=foo.fits.gz of=moo.fits.gz
600000+0 records in
600000+0 records out
600000 bytes (600 kB) copied, 1.56472 s, 383 kB/s
% ls -l moo.fits.gz
-rw-rw-r-- 1 kjg kjg 600000 Apr 20 14:57 moo.fits.gz
% gunzip moo.fits.gz
gzip: moo.fits.gz: unexpected end of file
I'm sure we can come up with examples where this fails or becomes murky (disk fill during compression).
This presumes that chandra_repro's gunziping works in same way ... which appears to be the case (it using the gzip module)
% python -c 'import gzip;r=gzip.GzipFile("moo.fits.gz","rb");open("bar","wb").write(r.read());'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/export/ciao-4.13/ots/lib/python3.7/gzip.py", line 276, in read
return self._buffer.read(size)
File "/export/ciao-4.13/ots/lib/python3.7/gzip.py", line 482, in read
raise EOFError("Compressed file ended before the "
EOFError: Compressed file ended before the end-of-stream marker was reached
Unfortunately, when the partial compressed file is used, nothing bad is reported
% dmlist moo.fits.gz blocks
--------------------------------------------------------------------------------
Dataset: moo.fits.gz
--------------------------------------------------------------------------------
Block Name Type Dimensions
--------------------------------------------------------------------------------
Block 1: PRIMARY Null
Block 2: ASPSOL Table 20 cols x 312866 rows
listing the data results in only 10861 rows and the last one has bogus values -- but no error messages as we'd hope :frowning:
I just came across the following, which happened because I had manually unpacked a hand-downloaded set of files from the archive into a directory I'd already run download_chandra_obsid. This meant that the
oif.fits
file was different, so there was an attempt to re-download it, but this failed because the file was created with read-only permissions.Things to think about
u+w
?oif.fits
heredownload_chandra_obsid
"