askap-craco / CELEBI

The CRAFT Effortless Localisation and Enhanced Burst Inspection Pipeline
MIT License
4 stars 2 forks source link

Astropy cache not cleared, creating lock file #444

Closed marcinglowacki closed 1 year ago

marcinglowacki commented 1 year ago

During a run of the pipeline, it had errored due to an astropy cache issue (see bolded part):

_WARNING: leap-second auto-update failed due to the following exception: RuntimeError('Cache is locked after 5.04 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /home/mglowack/.astropy/cache/download/py3/lock. Lock claims to be held by process 8255.') [astropy.time.core] /fred/oz002/askap/craft/craco/requests/requests/init.py:104: RequestsDependencyWarning: urllib3 (2.0.0.dev0) or chardet (None)/charset_normalizer (2.0.12) doesn't match a supported version! RequestsDependencyWarning) field_00.jmfit Traceback (most recent call last): File "/fred/oz002/askap/craft/craco/craco-postproc/craco_postproc/pipelines/../localise//RACS_lookup.py", line 103, in _main() File "/fred/oz002/askap/craft/craco/craco-postproc/craco_postproc/pipelines/../localise//RACS_lookup.py", line 33, in _main table = RACS_lookup(coord.ra_hms, coord.dec_dms, casdatap) File "/fred/oz002/askap/craft/craco/craco-postproc/craco_postproc/pipelines/../localise//RACS_lookup.py", line 93, in RACS_lookup f"SELECT * FROM casda.continuum_component where 1=CONTAINS(POINT('ICRS', ra_deg_cont, dec_deg_cont),CIRCLE('ICRS',{c.ra.value},{c.dec.value},0.005)) and project_id = 23" File "/fred/oz002/askap/craft/craco/astroquery/astroquery/utils/tap/core.py", line 414, in launch_job_async autorun) File "/fred/oz002/askap/craft/craco/astroquery/astroquery/utils/tap/core.py", line 627, in launchJob verbose=verbose) File "/fred/oz002/askap/craft/craco/astroquery/astroquery/utils/tap/conn/tapconn.py", line 273, in execute_tappost return self.execute_post(context, data, content_type, verbose) File "/fred/oz002/askap/craft/craco/astroquery/astroquery/utils/tap/conn/tapconn.py", line 426, in execute_post conn.request("POST", context, data, self.postHeaders) File "/apps/skylake/software/Python/3.7.4-gni-2020.0/lib/python3.7/http/client.py", line 1244, in request self._send_request(method, url, body, headers, encode_chunked) File "/apps/skylake/software/Python/3.7.4-gni-2020.0/lib/python3.7/http/client.py", line 1290, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/apps/skylake/software/Python/3.7.4-gni-2020.0/lib/python3.7/http/client.py", line 1239, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/apps/skylake/software/Python/3.7.4-gni-2020.0/lib/python3.7/http/client.py", line 1026, in _send_output self.send(msg) File "/apps/skylake/software/Python/3.7.4-gni-2020.0/lib/python3.7/http/client.py", line 966, in send self.connect() File "/apps/skylake/software/Python/3.7.4-gni-2020.0/lib/python3.7/http/client.py", line 1406, in connect super().connect() File "/apps/skylake/software/Python/3.7.4-gni-2020.0/lib/python3.7/http/client.py", line 938, in connect (self.host,self.port), self.timeout, self.source_address) File "/apps/skylake/software/Python/3.7.4-gni-2020.0/lib/python3.7/socket.py", line 707, in create_connection for res in getaddrinfo(host, port, 0, SOCK_STREAM): File "/apps/skylake/software/Python/3.7.4-gni-2020.0/lib/python3.7/socket.py", line 748, in getaddrinfo for res in _socket.getaddrinfo(host, port, family, type, proto, flags): socket.gaierror: [Errno -2] Name or service not known WARNING: leap-second auto-update failed due to the following exception: RuntimeError('Cache is locked after 5.01 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /home/mglowack/.astropy/cache/download/py3/lock. Lock claims to be held by process 8255.') [astropy.time.core]

Arguments specified: Namespace(askapnames='210912_names.dat', askappos='210912_ASKAP.dat', first=None, frbtitletext='210912', nvss=None, racs='210912_RACS.dat', sumss=None, vlass=None)

[] [] [] [] [] [] [] [] Traceback (most recent call last): File "/fred/oz002/askap/craft/craco/craco-postproc/craco_postproc/pipelines/../localise//src_offsets.py", line 1108, in askap2racs_offsets_ra, raweight_racs File "/fred/oz002/askap/craft/craco/craco-postproc/craco_postproc/pipelines/../localise//src_offsets.py", line 967, in weighted_avg_and_std wt_average = np.average(offsetval, weights=weights ** 2) File "<__array_function__ internals>", line 6, in average File "/apps/skylake/software/numpy/1.19.2-gni-2020.0-Python-3.7.4/lib/python3.7/site-packages/numpy-1.19.2-py3.7-linux-x86_64.egg/numpy/lib/function_base.py", line 410, in average "Weights sum to zero, can't be normalized") ZeroDivisionError: Weights sum to zero, can't be normalized /fred/oz002/askap/craft/craco/craco-postproc/craco_postproc/pipelines/../localise//weighted_multi_image_fit_updated.py:29: UserWarning: loadtxt: Empty input file: "askap2racs_offsets_unc.dat" data = np.loadtxt(infile) # atual data read statement

      #########################################################
      ## Clancy's slightly less dodgy error analysis program ##
      #########################################################

Reading in data from 1 input files Loading data infile askap2racs_offsets_unc.dat ... ERROR: Please ensure data is in correct format Four rows: ra offset, dec offset, ra err, dec err and one column per source. WARNING: leap-second auto-update failed due to the following exception: RuntimeError('Cache is locked after 5.03 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /home/mglowack/.astropy/cache/download/py3/lock. Lock claims to be held by process 8255.') [astropy.time.core] Traceback (most recent call last): File "/fred/oz002/askap/craft/craco/craco-postproc/craco_postproc/pipelines/../localise//apply_offset.py", line 45, in offset_ra, offset_ra_err, offset_dec, offset_dec_err = np.loadtxt(args.offset) File "/apps/skylake/software/numpy/1.19.2-gni-2020.0-Python-3.7.4/lib/python3.7/site-packages/numpy-1.19.2-py3.7-linux-x86_64.egg/numpy/lib/npyio.py", line 961, in loadtxt fh = np.lib._datasource.open(fname, 'rt', encoding=encoding) File "/apps/skylake/software/numpy/1.19.2-gni-2020.0-Python-3.7.4/lib/python3.7/site-packages/numpy-1.19.2-py3.7-linux-x86_64.egg/numpy/lib/_datasource.py", line 195, in open return ds.open(path, mode, encoding=encoding, newline=newline) File "/apps/skylake/software/numpy/1.19.2-gni-2020.0-Python-3.7.4/lib/python3.7/site-packages/numpy-1.19.2-py3.7-linux-x86_64.egg/numpy/lib/datasource.py", line 535, in open raise IOError("%s not found." % path)

Initially attempted solution was to clear the cache for the user at the beginning of every run of apply_offset.

marcinglowacki commented 1 year ago

As per below response from ozstar helpdesk, a change to datamover nodes solved the issue for the localise.nf jobs, on a per-process basis.

_indeed, it is by design that skylake/sstar jobs cannot contact the outside world. this is mainly 'cos we don't want cpus idle whilst jobs wait for slow web connections.

the nodes in our datamover queue have external network access. you can submit jobs to these using eg.

SBATCH -p datamover

nodes in the datamover queue have low memory and cpu limits. they aren't really suitable to run a compute job on.

a fairly typical workflow would be to submit a dm job to download what is needed, and then have the last line in the dm batch script be "sbatch skylake job". another way to do it would be to submit both jobs at once, but with a --dependency=afterok:job_id on the compute job.

compiled codes may need recompiling to run on dm's as they're older cpus than skylakes. you can get to them directly from farnarkle1/2 with eg. ssh dm2_