Closed ocehugo closed 2 years ago
@ocehugo @ggalibert @yashahoo I need a few more details on what is going on with these files. I run a fairly extensive sanity check on the files before calling the upload procedure that checks for 1, read/write errors; 2, variable dimensions; 3, variable content, sanity of the content, and write/read errors; 4, global attributes. it would definitely help me troubleshooting the issues if I have a full picture of the errors in the ingestion process as the log that is usually attached to the mail I receive upon failure is pretty scarce.
@lbesnard I need a few more details on what is going on with these files. I run a fairly extensive sanity check on the files before calling the upload procedure that checks for 1, read/write errors; 2, variable dimensions; 3, variable content, sanity of the content, and write/read errors; 4, global attributes. it would definitely help me troubleshooting the issues if I have a full picture of the errors in the ingestion process as the log that is usually attached to the mail I receive upon failure is pretty scarce.
I'm not sure of what you are receiving in the emails, but I assume it is something like the below.
I can surely attach all the logs, of all the individual files, so you could debug yourself. However, I did that for you already in the above post (and others). For example, below is an invalid netcdf file.
Are you sure you wanna it all?
2021-04-23 21:04:45,646 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] PipelineTask.success_exit_policies -> ['DELETE_SAME_NAME_FROM_ERROR_STORE']
2021-04-23 21:04:45,646 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] PipelineTask.error_exit_policies -> []
2021-04-23 21:04:45,647 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] IncomingFileStateManager.state -> 'FILE_IN_INCOMING'
2021-04-23 21:04:45,647 INFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] IncomingFileStateManager.error_broker -> LocalFileStorageBroker(prefix='/mnt/ebs/error/ACORN')
2021-04-23 21:04:45,821 INFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] IncomingFileStateManager.move_to_processing -> '/mnt/ebs/processing/ACORN/50711800-543d-4b3f-bbf0-698faa69dc93/IMOS_ACORN_RV_20210315T232500Z_PTB_FV00_radial.nc'
2021-04-23 21:04:45,822 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] IncomingFileStateManager.state -> 'FILE_IN_PROCESSING'
2021-04-23 21:04:45,883 INFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] running handler -> AcornHandler({'temp_dir': '/mnt/ebs/tmp/AcornHandler7jn3omvn/temp', 'instance_working_directory': '/mnt/ebs/tmp/AcornHandler7jn3omvn', 'archive_input_file': False, 'file_checksum': None, 'error': None, 'exclude_regexes': [], 'collection_dir': '/mnt/ebs/tmp/AcornHandler7jn3omvn/collection', 'check_params': None, 'notify_params': WriteOnceOrderedDict([('owner_notify_list', ['email:laurent.besnard@utas.edu.au', 'email:guillaume.galibert@utas.edu.au']), ('error_notify_list', ['email:simone.cosoli@uwa.edu.au']), ('notify_owner_error', True), ('notify_owner_success', False)]), 'resolve_params': None, 'celery_task_id': '50711800-543d-4b3f-bbf0-698faa69dc93', 'file_collection': None, 'file_basename': None, 'opendap_root': 'http://thredds.aodn.org.au/thredds/dodsC/', 'custom_params': None, 'archive_path_function': None, 'result': <HandlerResult.UNKNOWN: 0>, 'error_cleanup_regexes': None, 'notification_results': None, 'file_extension': None, 'harvest_params': None, 'should_notify': None, 'input_file': '/mnt/ebs/processing/ACORN/50711800-543d-4b3f-bbf0-698faa69dc93/IMOS_ACORN_RV_20210315T232500Z_PTB_FV00_radial.nc', 'celery_task_name': 'tasks.ACORN', 'harvest_type': 'talend', 'allowed_regexes': ['^IMOS_ACORN_[A-Z].*_[0-9]{8}T[0-9]{6}Z_[A-Z]{3,4}_FV00_(radial|sea-state|1-hour-avg)\\.nc$'], 'file_type': None, 'allowed_extensions': ['.nc'], 'input_file_archive_path': 'ACORN/IMOS_ACORN_RV_20210315T232500Z_PTB_FV00_radial.nc', 'allowed_dest_path_regexes': ['^IMOS/ACORN/.*FV00.*\\.nc$'], 'error_details': None, 'dest_path_function': None, 'allowed_archive_path_regexes': None, 'upload_path': 'ACORN/any/IMOS_ACORN_RV_20210315T232500Z_PTB_FV00_radial.nc', 'versions': {'python': '3.5.2', 'aodncore': '1.2.8', 'cc-plugin-imos': '1.4.6', 'compliance-checker': '4.1.1', 'aodndata': '1.3.41', 'aodntools': '1.4.7'}, 'start_time': datetime.datetime(2021, 4, 23, 21, 4, 45, 822507), 'products_dir': '/mnt/ebs/tmp/AcornHandler7jn3omvn/products', 'include_regexes': []})
2021-04-23 21:04:45,884 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] get_file_checksum -> 'c2353d0698a5dff3ccac74f67f3a0449b167ac25701a22cacd24a74c76100416'
2021-04-23 21:04:45,884 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] file_basename -> 'IMOS_ACORN_RV_20210315T232500Z_PTB_FV00_radial.nc'
2021-04-23 21:04:45,884 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] file_extension -> '.nc'
2021-04-23 21:04:45,884 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] file_type -> FileType.NETCDF
2021-04-23 21:04:45,885 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] get_path_function (upload) -> AcornHandler.dest_path
2021-04-23 21:04:45,885 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] get_path_function (archive) -> AcornHandler.dest_path
2021-04-23 21:04:45,885 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] AcornHandler transitioned to state: HANDLER_INITIALISED
2021-04-23 21:04:45,885 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] get_resolve_runner -> SingleFileResolveRunner()
2021-04-23 21:04:45,886 INFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] updated file 'IMOS_ACORN_RV_20210315T232500Z_PTB_FV00_radial.nc': {'publish_type': 'HARVEST_UPLOAD'}
2021-04-23 21:04:45,888 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] AcornHandler transitioned to state: HANDLER_RESOLVED
2021-04-23 21:04:45,888 INFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] updated file 'IMOS_ACORN_RV_20210315T232500Z_PTB_FV00_radial.nc': {'publish_type': 'HARVEST_UPLOAD'}
2021-04-23 21:04:45,939 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] get_store_runner (upload) -> StoreRunner(broker=S3StorageBroker(bucket='imos-data', prefix=''))
2021-04-23 21:04:46,012 ERROR tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] OSError: [Errno -101] NetCDF: HDF error: b'/mnt/ebs/tmp/AcornHandler7jn3omvn/collection/IMOS_ACORN_RV_20210315T232500Z_PTB_FV00_radial.nc'
2021-04-23 21:04:46,013 WARNING tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] exception is not a user-correctable problem, excluding 'error_notify_list' from notification
2021-04-23 21:04:46,014 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] get_notify_runner -> NotifyRunnerAdapter()
2021-04-23 21:04:46,015 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] get_child_notify_runner -> EmailNotifyRunner()
2021-04-23 21:04:46,016 INFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] email recipients: ['laurent.besnard@utas.edu.au', 'guillaume.galibert@utas.edu.au']
2021-04-23 21:04:49,056 INFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] all notification attempts were successful
2021-04-23 21:04:49,056 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] AcornHandler transitioned to state: HANDLER_NOTIFIED_ERROR
2021-04-23 21:04:49,056 INFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] handler result for input_file '/mnt/ebs/processing/ACORN/50711800-543d-4b3f-bbf0-698faa69dc93/IMOS_ACORN_RV_20210315T232500Z_PTB_FV00_radial.nc': ERROR
2021-04-23 21:04:49,056 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] AcornHandler transitioned to state: HANDLER_COMPLETED_ERROR
2021-04-23 21:04:49,057 INFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] IncomingFileStateManager.move_to_error -> '/mnt/ebs/error/ACORN/IMOS_ACORN_RV_20210315T232500Z_PTB_FV00_radial.nc.50711800-543d-4b3f-bbf0-698faa69dc93'
2021-04-23 21:04:49,058 SYSINFO tasks.ACORN[50711800-543d-4b3f-bbf0-698faa69dc93] IncomingFileStateManager.state -> 'FILE_IN_ERROR'
@ocehugo , these logs outputs aren't really helpful for @scosoli
I just need one as example of what may e wrong there.
in comparison to what you just share, the log I receive is something like (re.IMOS_ACORN_RV_20210225T020000Z_GUI_FV00_radial.nc)
"invalid format: did not validate as type: NETCDF"
so it is not giving me any details on where the failure occurs.
btw I think I found one small error in the code I am using for the error checks - an 'exit' instead of a 'return' in a try/catch loop that was breaking the check for the entire list of files if one of the inputs was corrupt; which resulted in following files potentially skipping the check and being uploaded. Please bear with me while I confirm this is what happened, and likely expect a few more corrupt files to try landing on the portal.
if that happens, please let me know and I will investigate further.
@scosoli:
invalid format: did not validate as type: NETCDF
This means the file is an invalid netcdf or corrupted. Maybe the process you are running started creating the file but something happened at the write/commit to disk time.
This kind of "interpretation" is exactly what I did above. For example, if you try to inspect an "invalid netcdf" you will get:
ncdump: IMOS_ACORN_RV_20210225T020000Z_GUI_FV00_radial.nc: NetCDF: Can't open HDF5 attribute
What I can provide is the full log OR an interpretation of the log messages based on both the log and files received :)
"interpretation" definitely helps, and is what I need, thanks. I have modified the script after the errors I found, added some verbosity and added a timing offset between the checks and the upload to possibly prevent issues. please keep an eye and let me know
@scosoli , I'' be watching the error folders over the coming days, so we can follow up on any left-over errors. I'll also create an issue so we follow up on improving the email error msgs.
@scosoli, I'm taking over from @ocehugo . Is this issue fixed?
@scosoli - We got plenty of files not ingested over the last 2mo in the NRT stream. It seems that some logic is needed to avoid sending corrupted NetCDF,s invalid netcfs, NetCDF without TIME records, and empty netcdfs.
The najority of files is from PTB and GUI, with the bulk from March/April.
Files with invalid content:
Invalid netcdfs:
files with zero records:
empty netcdfs:
miscellaneous errors (e.g.. global attributes):