aodn / content

Tracks AODN Portal content and configuration issues
0 stars 0 forks source link

ACORN NRT radial files corrupted #409

Closed ggalibert closed 2 years ago

ggalibert commented 5 years ago

The following files have been sent to us in a corrupted state:

Size Month Day Time File name
47K Oct 31 2018 IMOS_ACORN_RV_20181031T043000Z_RRK_FV00_radial.nc
0 Aug 12 18:22 IMOS_ACORN_RV_20181031T023000Z_CWI_FV00_radial.nc
0 Aug 12 18:22 IMOS_ACORN_RV_20190605T160000Z_CWI_FV00_radial.nc
0 Aug 12 18:22 IMOS_ACORN_RV_20190605T162000Z_CWI_FV00_radial.nc
0 Aug 12 18:22 IMOS_ACORN_RV_20190605T172000Z_CWI_FV00_radial.nc
0 Aug 12 18:22 IMOS_ACORN_RV_20190606T000000Z_CWI_FV00_radial.nc
0 Aug 12 18:22 IMOS_ACORN_RV_20190606T042000Z_CWI_FV00_radial.nc
0 Aug 12 18:22 IMOS_ACORN_RV_20190723T093000Z_CWI_FV00_radial.nc
0 Aug 12 18:22 IMOS_ACORN_RV_20190723T113000Z_CWI_FV00_radial.nc
0 Aug 12 18:22 IMOS_ACORN_RV_20190723T134000Z_CWI_FV00_radial.nc
0 Sep 4 17:54 IMOS_ACORN_RV_20190903T061000Z_GUI_FV00_radial.nc
228K Oct 20 09:27 IMOS_ACORN_RV_20191019T220000Z_CWI_FV00_radial.nc

@scosoli they need to be manually re-uploaded if a sane version exists somewhere otherwise let us know and we'll move on.

scosoli commented 5 years ago

Thanks Guillame - will check if the files can be reprocessed and if so will upload them. As a curiosity - what is the reported problem?

ggalibert commented 5 years ago

The problem is that these files are not sane NetCDFs. As you can see IMOS_ACORN_RV_20181213T100500Z_CSP_FV00_radial.nc is empty, some are truncated like IMOS_ACORN_RV_20190205T124500Z_NNB_FV00_radial.nc and some are not valid NetCDF at all.

The question is was it the upload operation that failed and left invalid files on our server or was the original file corrupted already (a problem happened during the generation of the file)?

scosoli commented 5 years ago

interestingly it seems to only happen with the wera radials for which the conversion is managed through the python scripts. we'll check what the problem is. I can add a sanity check before launching the transfer process but need to fnd where the problem is

ggalibert commented 5 years ago

@scosoli 5 more files were uploaded corrupted on the 18 April from CSP and NNB. They have been added to the list above.

scosoli commented 5 years ago

thanks for reporting that. I'm implementing a sanity check right now that will possibly get rid of this issue. will test this week

scosoli commented 5 years ago

the sanity check is implemented - right before the rsync upload a script checks for I/O errors, missing variables and so on and if so the corrupt file is moved to an 'error' directory on our server for later checks. as it is now the script should be fairly robust but please let me know if corupt files are still uploaded and I'll fix it properly. as it is now it only acts on the python-generated radials

scosoli commented 5 years ago

will reprocess and upload the missing files later on when I'm back from leave

ggalibert commented 5 years ago

Thank you, please let us know when you do re-upload the files.

ggalibert commented 5 years ago

1 more file was uploaded corrupted on the 27 May from GUI. It has been added to the list above.

lbesnard commented 5 years ago

@scosoli more files landed in the error directory today. They are either corrupted, or lacking pretty much everything.

scosoli commented 5 years ago

will see what I can do. but it will have to wait as I have other priorities right now

scosoli commented 5 years ago

we have started investigating what is going on - I have asked @badema to have a look and there's a couple of ongoing issues that need to be fixed

lbesnard commented 5 years ago

@scosoli, after the current communication outage we had since last night, a few more files landed in our incoming directory empty.

scosoli commented 5 years ago

I am reprocessing and uploading the corrupt files. So far I have reprocessed with the matlab version of the RT scripts the files from CWI, and as far as I can see they seem to have landed successfully to the portal. please correct me if I am wrong and I'll investigate that more in detail

scosoli commented 5 years ago

as far as I can see, all files that have been reported above as corrupt have been reprocessed and uploaded to the portal. please report any file I may have missed

ggalibert commented 5 years ago

@scosoli thank you for that. I can confirm that most files have been re-uploaded successfully except one. Unfortunately there has been some new corrupted files that landed 2 days ago. Please see above for the most recent list of corrupted files to be re-uploaded.

scosoli commented 5 years ago

I can't see why the file would fail. can you provide more details?

ggalibert commented 5 years ago

this is what I use to transfer data from the RT queue to the incoming directory on your end:

bash rsync --password-file $path_to_password -ruv --remove-source-files ~/queued/*.nc acorn@incoming.aodn.org.au::acorn_staging >> ~/transfer_log

this is embedded in a very basic shell script used to set env. variables and paths to the password and similar. it is run via cronjob every 15 minutes to keep up with the data flow. the only thing I can possibly think of, is that a separate process runs calling rsync on a file that is then being removed by a previous call -- @scosoli

Taking one of the files as an example, it was first uploaded with a length of 291620 bytes and then exactly 15 minutes later a zero length file was uploaded.

$ grep IMOS_ACORN_RV_20190606T042000Z_CWI_FV00_radial.nc rsync_acorn_staging.log
2019/08/12 18:07:09 [17180] recv UNKNOWN [130.95.29.7] acorn_staging (acorn) IMOS_ACORN_RV_20190606T042000Z_CWI_FV00_radial.nc 291620
2019/08/12 18:22:09 [7244] recv UNKNOWN [130.95.29.7] acorn_staging (acorn) IMOS_ACORN_RV_20190606T042000Z_CWI_FV00_radial.nc 0

The first file was successfully published, with the expected file length:

$ aws s3 ls --no-sign s3://imos-data/IMOS/ACORN/radial/CWI/2019/06/06/IMOS_ACORN_RV_20190606T042000Z_CWI_FV00_radial.nc
2019-08-12 18:07:32     291620 IMOS_ACORN_RV_20190606T042000Z_CWI_FV00_radial.nc

The smoking gun is really that second upload of a zero length file, suggesting a script bug on your side.

ggalibert commented 5 years ago

A quick note to let you know that we have started investigating one of the possible sources of error / corruption in the netcdf creation stage. Badema has developed all the python scripts for the purpose and came across a HDF problem which we’ll try to solve. She will be providing further details on this - which seems to be occurring on a random basis and seems to be a known issue with netcdf creation stages. -- @scosoli

ggalibert commented 5 years ago

Another empty file came up yesterday. See updated list above.

ggalibert commented 4 years ago

Another corrupted file came up on Sun 20/10/2019. See updated list above.

scosoli commented 4 years ago

Yes I was notified on my email too. we're having some annoying issues with CWI and we don't seem to be able to find the source. looks like an incompatibility with the cpci or some faulty cables

ocehugo commented 4 years ago

@scosoli, FYI: I'm trying to cleanup the backlog of errored files on the ACORN stream.

We still got the files mentioned above in the error directory. All of them are empty or invalid netcdf files.

The CWI files below are already published (from 2018) and failed because of the files being empty (0-bytes) /invalid netcdf files.

date time File name
2020-01-29 11:51:39.351823494 IMOS_ACORN_RV_20181031T023000Z_CWI_FV00_radial.nc
2020-01-29 11:51:46.711575334 IMOS_ACORN_RV_20190605T160000Z_CWI_FV00_radial.nc
2020-01-29 11:51:50.123460291 IMOS_ACORN_RV_20190605T162000Z_CWI_FV00_radial.nc
2020-01-29 11:51:53.555344574 IMOS_ACORN_RV_20190605T172000Z_CWI_FV00_radial.nc
2020-01-29 11:51:57.123224273 IMOS_ACORN_RV_20190606T000000Z_CWI_FV00_radial.nc
2020-01-29 11:52:00.791100600 IMOS_ACORN_RV_20190606T042000Z_CWI_FV00_radial.nc
2020-01-29 11:52:04.190985964 IMOS_ACORN_RV_20190723T093000Z_CWI_FV00_radial.nc
2020-01-29 11:52:07.614870520 IMOS_ACORN_RV_20190723T113000Z_CWI_FV00_radial.nc
2020-01-29 11:52:10.998756425 IMOS_ACORN_RV_20190723T134000Z_CWI_FV00_radial.nc
2020-01-29 11:52:18.278510974 IMOS_ACORN_RV_20191019T220000Z_CWI_FV00_radial.nc
2020-04-01 21:21:50.392496494 IMOS_ACORN_RV_20200327T044000Z_CWI_FV00_radial.nc
2020-04-01 20:21:54.177863730 IMOS_ACORN_RV_20200327T121000Z_CWI_FV00_radial.nc

The GUI/RRK files below aren't published and, although they failed a bit down the line (compliance-checker) they are not valid netcdf files ( I think the compliance checker cannot open then so they fail miserably).

date time File name
2020-01-29 11:52:14.614634509 IMOS_ACORN_RV_20190903T061000Z_GUI_FV00_radial.nc
2020-01-29 11:51:43.115696581 IMOS_ACORN_RV_20181031T043000Z_RRK_FV00_radial.nc