E3SM-Project / zstash

Long term HPSS archiving tool for E3SM
BSD 3-Clause "New" or "Revised" License
8 stars 11 forks source link

check errors disappear at the second try #174

Open tangq opened 2 years ago

tangq commented 2 years ago

I encountered the errors below for a few files when specifying the tar file names.

When I retried it, they were checked successfully. What these errors mean? Why they disappear when trying again? Thanks.

INFO: Transferring file from HPSS: zstash/000103.tar
ERROR: Error=Transferring file from HPSS: 000103.tar, Command was `hsi -q "cd /home/t/tang30/E3SMv2/v2.NARRM.piControl; get 000103.tar"`. This command includes `hsi`. Be sure that you have logged into `hsi`.
Traceback (most recent call last):
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.5.1_nompi/bin/zstash", line 10, in <module>
    sys.exit(main())
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.5.1_nompi/lib/python3.8/site-packages/zstash/main.py", line 68, in main
    check()
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.5.1_nompi/lib/python3.8/site-packages/zstash/check.py", line 12, in check
    extract.extract(keep_files=False)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.5.1_nompi/lib/python3.8/site-packages/zstash/extract.py", line 44, in extract
    failures: List[FilesRow] = extract_database(args, cache, keep_files)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.5.1_nompi/lib/python3.8/site-packages/zstash/extract.py", line 210, in extract_database
    failures = extractFiles(matches, keep_files, config.keep, cache)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.5.1_nompi/lib/python3.8/site-packages/zstash/extract.py", line 374, in extractFiles
    hpss_get(hpss, tfname, cache)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.5.1_nompi/lib/python3.8/site-packages/zstash/hpss.py", line 99, in hpss_get
    hpss_transfer(hpss, file_path, "get", cache, False)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.5.1_nompi/lib/python3.8/site-packages/zstash/hpss.py", line 76, in hpss_transfer
    run_command(command, error_str)
  File "/global/common/software/e3sm/anaconda_envs/base/envs/e3sm_unified_1.5.1_nompi/lib/python3.8/site-packages/zstash/utils.py", line 55, in run_command
    raise Exception(error_str)
Exception: Error=Transferring file from HPSS: 000103.tar, Command was `hsi -q "cd /home/t/tang30/E3SMv2/v2.NARRM.piControl; get 000103.tar"`. This command includes `hsi`. Be sure that you have logged into `hsi`.
forsyth2 commented 2 years ago

@tangq I'm not sure why they disappeared on a second try. The Exception notes that you might not have been logged into hsi the first time, but that's certainly not the only possible reason that that tar failed to transfer.

tangq commented 2 years ago

I had another check job without specifying the tar file names failed with similar errors. So, the more flexible input file names options you implemented are really helpful when checking the large simulations.

My question is with the successful second check, is it safe to say that these tar files are correctly archived?

forsyth2 commented 2 years ago

So, the more flexible input file names options you implemented are really helpful when checking the large simulations.

Great, glad to hear #170 is working well.

My question is with the successful second check, is it safe to say that these tar files are correctly archived?

If you have a log file from the check, you can run grep -i Exception <log_file> to double check for any errors.

tangq commented 2 years ago

The log file for the second try (only checking the 3 failed files) is at cori:/global/cscratch1/sd/tang30/E3SMv2/v2.NARRM.piControl/check2/out. Nothing returns from grep -i Exception.

The log file for the first try is: /global/cscratch1/sd/tang30/E3SMv2/v2.NARRM.piControl/zstash_check_20211209.log. It is still ongoing and returns the 3 files when grepping exception.

Good to know the "exception" key word - less messages than "error".

forsyth2 commented 2 years ago

Good to know the "exception" key word

Great, I'm planning to include a note about that with #168.

golaz commented 2 years ago

I have also seen errors like this before. If the error is not reproducible, it is very likely that it was caused by some intermittent hsi issue or unavailability.

Also, I have found that hsi errors are more likely when retrieving to CSCRATCH. I now use cfs to run zstash check and it seems more reliable.

wagmanbe commented 10 months ago

I have also run into This command includes hsi. Be sure that you have logged into hsi intermittently. I can connect to hsi. I'm wondering whether anyone has figured out a solution?

forsyth2 commented 10 months ago

I can connect to hsi. I'm wondering whether anyone has figured out a solution?

@wagmanbe #314 -- that error message offers one possible solution (probably the most common), but there may be other things wrong.

wagmanbe commented 10 months ago

Thank you. It's helpful to know that there is not one obvious cause. I'll keep troubleshooting. And yes, I can log into hsi manually, e.g. > hsi