Bad measurement sets being written out

harrisonbarlow commented 2 years ago

Natasha has reported that she's seeing problems with some measurement sets created by Birli. I believe that she was trying to apply a calibration solution using Casacore and ran into what(): FilebufIO::readBlock - incorrect number of bytes read for file /astro/mwasci/nhurleywalker/Epoch0032/1338206432/1338206432.ms/table.f0i

This was seen with the following obsids 1338205248 1338205544 1338206432 1338212768 1338217088 1338217976

I have recreated the measurement set and left it in /astro/mwaops/asvo/577447 for somebody to investigate the issue further.

harrisonbarlow commented 2 years ago

Also, the command used to run Birli was /opt/cargo/bin/birli --max-memory 300 --avg-freq-res 40 --avg-time-res 4 --flag-edge-width 80 -M /nvmetmp/1338205544.ms -m /astro/mwaasvo/mwaservice/asvo/prod/jobs/577447/raw/1338205544.metafits /astro/mwaasvo/mwaservice/asvo/prod/jobs/577447/raw/1338205544_*.fits

d3v-null commented 2 years ago

I wasn't able to reproduce any issue reading this MS with CASA. Can you provide more specific instructions on how to replicate the issue? Here's my set up that demonstrates CASA can read all the rows of all the columns of the main table on garrawarla:

salloc --partition workq --time 1:00:00 --nodes=1 --mem=100000
module use /pawsey/mwa/software/python3/modulefiles
module load casa
casa

...
PIPELINE CASA 5.6.1-8   -- Common Astronomy Software Applications
...

tb.open('1338205544.ms/')
tb.getcol('DATA')

array([[[  1.28700398e+05 +2.98831765e-06j,
           1.81150366e+03 +1.36531421e+03j,
          -2.44577194e+02 -4.16838913e+01j, ...

for col in tb.colnames():
    if col is 'FLAG_CATEGORY': continue
    print(tb.getcol(col).shape)

('UVW', (3, 689384))
('FLAG', (4, 768, 689384))
('WEIGHT', (4, 689384))
('SIGMA', (4, 689384))
('ANTENNA1', (689384,))
('ANTENNA2', (689384,))
('ARRAY_ID', (689384,))
('DATA_DESC_ID', (689384,))
('EXPOSURE', (689384,))
('FEED1', (689384,))
('FEED2', (689384,))
('FIELD_ID', (689384,))
('FLAG_ROW', (689384,))
('INTERVAL', (689384,))
('OBSERVATION_ID', (689384,))
('PROCESSOR_ID', (689384,))
('SCAN_NUMBER', (689384,))
('STATE_ID', (689384,))
('TIME', (689384,))
('TIME_CENTROID', (689384,))
('DATA', (4, 768, 689384))
('WEIGHT_SPECTRUM', (4, 768, 689384))

Birli doesn't write the FLAG_CATEGORY column, but neither does Cotter.

nhurleywalker commented 2 years ago

I was using applycal, one of Andre's MWA Tools.

I traced back the logs and the first time I encountered the error was after downloading the file from ASVO:

Ready for Download: Job id: 577235 Obs id: 1338206432 type: conversion size: 25847367680 bytes Downloading: Job id: 577235 file: https://ingest.pawsey.org.au/mwa-asvo/1338206432_577235_ms.tar?AWSAccessKeyId=0f61c75cd1184e5abc76500d71758927&Signature=XFUXNVIxxAUxfY8buZh3WG2MwMU%3D&Expires=1657705113 size: 25847367680 bytes

Then I try to move it and untar it:

`+ mv .././1338206432_577235_ms.tar ./

tar xf ./1338206432_577235_ms.tar tar: Unexpected EOF in archive tar: rmtlseek not stopped at a record boundary tar: Error is not recoverable: exiting now`

So it's not a measurement set error per se, it's something wrong with the transfer from ASVO. I don't have any further debugging in the logs and I'm afraid I didn't keep the measurement sets because they were obviously broken, but if I see this error again, I'll keep them and send them to you for analysis.

d3v-null commented 2 years ago

I've actually noticed the unexpected EOF when downloading from Acacia myself. From what I can gather, downloads can just randomly fail, and it's up to the client to detect this and resume. For my pipeline this is particularly annoying because we pipe curl directly into tar, so when we encounter this, there's no easy way to recover. However, since you're downloading the tar and decompressing separately, you can make use of wget's ability to automatically resume a connection, although I haven't tested this myself. From the manpage:

Wget has been designed for robustness over slow or unstable network connections;
if a download fails due to a network problem, it will keep retrying until the
whole file has been retrieved.  If the server supports regetting, it will
instruct the server to continue the download from where it left off.

Since we've established that this is an issue with Accacia and not Birli, I'll close this issue. Glad we could get to the bottom of it!

MWATelescope / Birli

Bad measurement sets being written out #91