Test second shell script `MCMC_800_1s-1.sh`

gcapes commented 5 months ago

Check I can get this to run on CSF3

gcapes commented 5 months ago

I get this error message from the imports section of the python script:

  import pandas as pd
2024-04-04 10:06:08.867980: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-04 10:06:11.265771: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-04 10:06:11.267078: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-04 10:06:45.939341: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

@RoryAtBar Have you encountered this previously?

I've made sure to install the correct versions of tensorflow and tensorflow-probability, which I've now added to a requirements file and documented in the csf setup file.

The versions of packages I have from pip freeze are here:

absl-py==2.1.0
arviz==0.17.0
astunparse==1.6.3
cachetools==5.3.2
certifi==2023.11.17
charset-normalizer==3.3.2
check-shapes==1.1.1
cloudpickle==3.0.0
cons==0.4.6
contourpy==1.2.0
cycler==0.12.1
decorator==5.1.1
Deprecated==1.2.14
dm-tree==0.1.8
dropstackframe==0.1.0
etuples==0.3.9
fastprogress==1.0.3
filelock==3.13.1
flatbuffers==23.5.26
fonttools==4.47.2
gast==0.4.0
google-auth==2.27.0
google-auth-oauthlib==1.0.0
google-pasta==0.2.0
gpflow==2.9.0
grpcio==1.60.0
h5netcdf==1.3.0
h5py==3.10.0
idna==3.6
jax==0.4.24
keras==2.12.0
kiwisolver==1.4.5
lark==1.1.9
libclang==16.0.6
logical-unification==0.4.6
Markdown==3.5.2
MarkupSafe==2.1.4
matplotlib==3.8.2
miniKanren==1.0.3
ml-dtypes==0.2.0
multipledispatch==1.0.0
numpy==1.24.3
oauthlib==3.2.2
opt-einsum==3.3.0
packaging==23.2
pandas==2.2.0
pillow==10.2.0
protobuf==4.23.4
pyasn1==0.5.1
pyasn1-modules==0.3.0
pymc==5.10.3
pyparsing==3.1.1
pytensor==2.18.6
python-dateutil==2.8.2
pytz==2023.4
requests==2.31.0
requests-oauthlib==1.3.1
rsa==4.9
scipy==1.12.0
six==1.16.0
tabulate==0.9.0
tensorboard==2.12.3
tensorboard-data-server==0.7.2
tensorflow==2.12.1
tensorflow-estimator==2.12.0
tensorflow-io-gcs-filesystem==0.35.0
tensorflow-probability==0.20.1
termcolor==2.4.0
toolz==0.12.1
typing_extensions==4.5.0
tzdata==2023.4
urllib3==2.2.0
Werkzeug==3.0.1
wrapt==1.14.1
xarray==2024.1.1
xarray-einstats==0.7.0

gcapes commented 4 months ago

Rory suggested trying gpflow <= 2.5.2

gcapes commented 4 months ago

Have resubmitted with gpflow=2.5.2 and it looks to be running so far...

gcapes commented 4 months ago

Ok so I get what looks to be sensible output, but also this error. Should I be concerned/do you know how I can fix this? @RoryAtBar

  import pandas as pd
2024-04-25 09:17:55.314937: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-25 09:17:58.346260: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-25 09:17:58.347395: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-25 09:18:38.423735: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-04-25 10:17:51.302837: W tensorflow/core/kernels/linalg/cholesky_op.cc:56] Cholesky decomposition was not successful. Eigen::LLT failed with error code 1. Filling lower-triangular output with NaNs.
Traceback (most recent call last):
  File "/net/scratch2/mbexegc2/Abaqus_bayesian_matflow/MCMC_800C_1s-1.py", line 432, in <module>
    idata = pm.sample(tune=10000, draws=20000, step=step,cores=1, chains=5)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/net/scratch2/mbexegc2/Abaqus_bayesian_matflow/.venv/lib/python3.11/site-packages/pymc/sampling/mcmc.py", line 744, in sample
    model.check_start_vals(ip)
  File "/net/scratch2/mbexegc2/Abaqus_bayesian_matflow/.venv/lib/python3.11/site-packages/pymc/model/core.py", line 1660, in check_start_vals
    raise SamplingError(
pymc.exceptions.SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'Friction_interval__': array(0.), 'Conductance_interval__': array(0.)}

Logp initial evaluation results:
{'Friction': -1.39, 'Conductance': -1.39, 'likelihood': nan}
You can call `model.debug()` for more details.

RoryAtBar commented 4 months ago

The issue with running the script on GPUs I'm not sure about, but it doesn't sound like a major problem.

This issue with initial evaluation results, yes I have encountered it before. The problem is essentially that the likelihood function is somehow mis-specified, and is giving spurious results, so the chains are being initialised outside of what should be allowed by the prior probability distribution (which is specified in the pm.Model() context manager).

The likelihood function uses the Gaussian process model. There could be something wrong with the GP, does the script plot the fit of the GP? If the GP looks ok, then I'll need to plot out some of the values of the likelihood function.

Might be worth me having a play with the script, I can have a look early next week

gcapes commented 4 months ago

Hi Rory, you asked on slack

I think the script plots the gaussian process against the FEM data. Did the script produce a JPEG file that shows blue lines running through black dots?

Not that I can see - I guess this means it's an important error :)

gcapes commented 3 months ago

The issue with running the script on GPUs I'm not sure about, but it doesn't sound like a major problem.

This issue with initial evaluation results, yes I have encountered it before. The problem is essentially that the likelihood function is somehow mis-specified, and is giving spurious results, so the chains are being initialised outside of what should be allowed by the prior probability distribution (which is specified in the pm.Model() context manager).

The likelihood function uses the Gaussian process model. There could be something wrong with the GP, does the script plot the fit of the GP? If the GP looks ok, then I'll need to plot out some of the values of the likelihood function.

Might be worth me having a play with the script, I can have a look early next week

Hi @RoryAtBar Did you manage to have a look at this?

RoryAtBar commented 3 months ago

Hi Gerard,

I have been very ill this week so haven't. I will look at it asap.

Rory

Sent from Outlook for Androidhttps://aka.ms/AAb9ysg

From: Gerard Capes @.> Sent: Friday, May 17, 2024 3:59:49 PM To: RoryAtBar/Abaqus_bayesian_matflow @.> Cc: RoryAtBar @.>; Mention @.> Subject: Re: [RoryAtBar/Abaqus_bayesian_matflow] Test second shell script MCMC_800_1s-1.sh (Issue #5)

The issue with running the script on GPUs I'm not sure about, but it doesn't sound like a major problem.

This issue with initial evaluation results, yes I have encountered it before. The problem is essentially that the likelihood function is somehow mis-specified, and is giving spurious results, so the chains are being initialised outside of what should be allowed by the prior probability distribution (which is specified in the pm.Model() context manager).

The likelihood function uses the Gaussian process model. There could be something wrong with the GP, does the script plot the fit of the GP? If the GP looks ok, then I'll need to plot out some of the values of the likelihood function.

Might be worth me having a play with the script, I can have a look early next week

Hi @RoryAtBarhttps://github.com/RoryAtBar Did you manage to have a look at this?

— Reply to this email directly, view it on GitHubhttps://github.com/RoryAtBar/Abaqus_bayesian_matflow/issues/5#issuecomment-2117794104, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A3IND576PQUVGSOBT3CLUK3ZCYLOLAVCNFSM6AAAAABFVR4PVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJXG44TIMJQGQ. You are receiving this because you were mentioned.Message ID: @.***>

RoryAtBar commented 3 months ago

Hi Gerard,

I have added a solution to an extra branch (gp_kernel_tester) which trains GP models of increasing flexibility until one works. It's crude and not scientifically rigorous but it is adequate for this specific problem, though might need to be changed at a later date if a more general solution is needed.

Seems to be working for now.

gcapes commented 3 months ago

Just submitted a job using this new script.

gcapes commented 3 months ago

AttributeError: module 'gpflow.models' has no attribute 'Matern52' @RoryAtBar any ideas on this one?

RoryAtBar commented 3 months ago

Hang on, I think I know what this one is. I will sort it out

Sent from Outlook for Androidhttps://aka.ms/AAb9ysg

From: Gerard Capes @.> Sent: Friday, May 31, 2024 1:31:08 PM To: RoryAtBar/Abaqus_bayesian_matflow @.> Cc: RoryAtBar @.>; Mention @.> Subject: Re: [RoryAtBar/Abaqus_bayesian_matflow] Test second shell script MCMC_800_1s-1.sh (Issue #5)

AttributeError: module 'gpflow.models' has no attribute 'Matern52' @RoryAtBarhttps://github.com/RoryAtBar any ideas on this one?

— Reply to this email directly, view it on GitHubhttps://github.com/RoryAtBar/Abaqus_bayesian_matflow/issues/5#issuecomment-2141989808, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A3IND5Y3GJMKCPLV6S3WHEDZFBUQZAVCNFSM6AAAAABFVR4PVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBRHE4DSOBQHA. You are receiving this because you were mentioned.Message ID: @.***>

RoryAtBar commented 3 months ago

This is from MCMC_800C_1s-1.py right?

The error makes it sound like somewhere in the code there is a line that says: gpflow.models.Matern52()

If that was the case, then the fix is to change this line to gpflow.models.GPR() and make sure that the kernel is specified correctly i.e.

kernel=gpflow.kernels.Matern52()

where

model = gpflow.models.GPR(
    (X_normed, Y[cond_filter,None]),
    kernel=gpflow.kernels.Matern52(np.shape(X_normed)[-1], lengthscales=np.ones(np.shape(X_normed)[-1])),)

I had this previously because when creating the branch gp_kernel_tester, I had put this in by mistake and fixed it. When you sent this error, I presumed I had simply forgotten to push it to github. I can't however find this error in the code, would you be able to direct me to it?

gcapes commented 3 months ago

Looks like you found it :) With the changes you made in 943fcc4 and e407eb91 this script now looks to be running ok.

gcapes commented 3 months ago

@RoryAtBar Could you take a quick look at this and confirm whether they're as expected?

$ cat MCMC_800C_1s-1.sh.e4990184 
mkdir: cannot create directory ‘/mnt/iusers01/support/mbexegc2/scratch/MCMC_GPsurrgt_800C_1s-1_cond0-1500_20000_chain’: File exists
/net/scratch2/mbexegc2/Abaqus_bayesian_matflow/MCMC_800C_1s-1.py:10: DeprecationWarning: 
Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

  import pandas as pd
2024-06-04 09:15:26.741914: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-04 09:15:29.923159: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-06-04 09:15:29.924662: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-06-04 09:16:18.024203: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-06-04 09:44:59.306185: W tensorflow/core/kernels/linalg/cholesky_op.cc:56] Cholesky decomposition was not successful. Eigen::LLT failed with error code 1. Filling lower-triangular output with NaNs.
2024-06-04 09:45:06.553742: W tensorflow/core/kernels/linalg/cholesky_op.cc:56] Cholesky decomposition was not successful. Eigen::LLT failed with error code 1. Filling lower-triangular output with NaNs.
/net/scratch2/mbexegc2/Abaqus_bayesian_matflow/MCMC_800C_1s-1.py:465: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  Force_at_800C_1s[n] = dat[:,1][abs((dat[:,0]+x_correction)-step)==min(abs((dat[:,0]+x_correction)-step))]
Sequential sampling (5 chains in 1 job)
CompoundStep
>Metropolis: [Friction]
>Metropolis: [Conductance]
Sampling 5 chains for 10_000 tune and 20_000 draw iterations (50_000 + 100_000 draws total) took 23522 seconds.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

The output log file looks ok, except that the progress bar lines show mac line endings, which doesn't match the rest of the file. Do you know which part of the code generates these?

MCMC_800C_1s-1.sh.o4990184: |█████████████████████████████| 100.00% [30000/30000 1:16:54<00:00 Sampling chain 4, 0 divergences]

RoryAtBar commented 3 months ago

1) I don't know much about pyarrow

2) the failed cholesky decomposition is hopefully dealt with using the additions in the gp_kernel_check branch, but I will check. I created an additional output.txt file since printing in the standard output file doesn't always work

3) the progress bar is generated to track the progress of the sampling when called using:

idata = pm.sample()

This is called within the pymc context manager which in the code is created with this line

with pm.Model() as model:

The low effective sample size may need addressing. I will look when I get a chance later today.

Sent from Outlook for Androidhttps://aka.ms/AAb9ysg

From: Gerard Capes @.> Sent: Wednesday, June 5, 2024 11:13:36 AM To: RoryAtBar/Abaqus_bayesian_matflow @.> Cc: RoryAtBar @.>; Mention @.> Subject: Re: [RoryAtBar/Abaqus_bayesian_matflow] Test second shell script MCMC_800_1s-1.sh (Issue #5)

@RoryAtBarhttps://github.com/RoryAtBar Could you take a quick look at this and confirm whether they're as expected?

$ cat MCMC_800C_1s-1.sh.e4990184 mkdir: cannot create directory ‘/mnt/iusers01/support/mbexegc2/scratch/MCMC_GPsurrgt_800C_1s-1_cond0-1500_20000_chain’: File exists /net/scratch2/mbexegc2/Abaqus_bayesian_matflow/MCMC_800C_1s-1.py:10: DeprecationWarning: Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0), (to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries) but was not found to be installed on your system. If this would cause problems for you, please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466

import pandas as pd 2024-06-04 09:15:26.741914: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2024-06-04 09:15:29.923159: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2024-06-04 09:15:29.924662: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-06-04 09:16:18.024203: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2024-06-04 09:44:59.306185: W tensorflow/core/kernels/linalg/cholesky_op.cc:56] Cholesky decomposition was not successful. Eigen::LLT failed with error code 1. Filling lower-triangular output with NaNs. 2024-06-04 09:45:06.553742: W tensorflow/core/kernels/linalg/cholesky_op.cc:56] Cholesky decomposition was not successful. Eigen::LLT failed with error code 1. Filling lower-triangular output with NaNs. /net/scratch2/mbexegc2/Abaqus_bayesian_matflow/MCMC_800C_1s-1.py:465: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.) Force_at_800C_1s[n] = dat[:,1][abs((dat[:,0]+x_correction)-step)==min(abs((dat[:,0]+x_correction)-step))] Sequential sampling (5 chains in 1 job) CompoundStep

Metropolis: [Friction] Metropolis: [Conductance] Sampling 5 chains for 10_000 tune and 20_000 draw iterations (50_000 + 100_000 draws total) took 23522 seconds. The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details The effective sample size per chain is smaller than 100 for some parameters. A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

The output log file looks ok, except that the progress bar lines show max line endings, which doesn't match the rest of the file. Do you know which part of the code generates these?

MCMC_800C_1s-1.sh.o4990184: |█████████████████████████████| 100.00% [30000/30000 1:16:54<00:00 Sampling chain 4, 0 divergences]

— Reply to this email directly, view it on GitHubhttps://github.com/RoryAtBar/Abaqus_bayesian_matflow/issues/5#issuecomment-2149419264, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A3IND52LPBWC2YDGRGTIR2DZF3QFBAVCNFSM6AAAAABFVR4PVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBZGQYTSMRWGQ. You are receiving this because you were mentioned.Message ID: @.***>

RoryAtBar commented 2 months ago

Very sorry for the slow response,

The results look ok, the actual values are a bit odd, possibly because of testing this with a limited set of data (conductance limited to 1500).

I'm not getting the issue with the limited effective sample size, maybe I'm using a different set of input data to you? All I have done is used the scripts currently in the main branch

gcapes commented 2 months ago

Hi Rory,

That's encouraging. It's been a while since I last looked at this but I think I was using the gp_kernel_tester branch.

RoryAtBar commented 2 months ago

Sure, but I don't get it on that branch either. There is a limited amount of randomness in the starting point where GPs are trained... unless you are using a different set of data I can't think what else it could be...

Sent from Outlook for Androidhttps://aka.ms/AAb9ysg

From: Gerard Capes @.> Sent: Friday, June 14, 2024 2:58:47 PM To: RoryAtBar/Abaqus_bayesian_matflow @.> Cc: RoryAtBar @.>; Mention @.> Subject: Re: [RoryAtBar/Abaqus_bayesian_matflow] Test second shell script MCMC_800_1s-1.sh (Issue #5)

Hi Rory,

That's encouraging. It's been a while since I last looked at this but I think I was using the gp_kernel_tester branch.

— Reply to this email directly, view it on GitHubhttps://github.com/RoryAtBar/Abaqus_bayesian_matflow/issues/5#issuecomment-2168112881, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A3IND55JDZE3NC5EDATZNFTZHLZJPAVCNFSM6AAAAABFVR4PVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRYGEYTEOBYGE. You are receiving this because you were mentioned.Message ID: @.***>

gcapes commented 2 months ago

Might be different version of the libraries perhaps? I'll try to have another look at this next week when I've got back up to speed with things :)

gcapes commented 1 month ago

I'll re-run this next week to see if I still get the error. Rory said there's a bit of randomness involved and I might have got a a bad seed. It can be set up to re-start if it fails, but currently isn't.

gcapes commented 1 month ago

I forgot that this script uses the output from the first one... I was tidying up and deleted the output so I'm running it again before I can run the second script. :roll_eyes:

gcapes commented 1 month ago

Second script now running using the test-second-step branch, having run the first job using the main branch.

gcapes commented 1 month ago

Same error - re-reading some detail, I see this was the wrong branch! Resubmitting on gp_kernel_tester

gcapes commented 2 weeks ago

I think this has run successfully now. Is this image any / a good measure that the job has run well?

If so I'll move on to trying to re-jig the code into MatFlow

RoryAtBar commented 2 weeks ago

Thanks Gerard,

Unfortunately, the image shown shows an extreme case of overfitting. I have re-jigged the way the Gaussian processes are trained for the part of the project I am currently working through. At the risk of you killing me, can we have a call where I show you how I want it to work?

1) I want to change the GP from fitting individual data points to fitting basis functions using scikit-fda 2) Randomly separate out training data and validation data and test the fit of the validation data (about 20% of the samples to be used not for conditioning the GP, but for checking that the predicted values fit correctly) 3)Automatically check which of four kernels fits best rather than picking the first one that fits at all

Then there is the MCMC step in that script that needs a small modification to adapt to the above change

gcapes commented 2 weeks ago

Sure - I could do tomorrow or Friday?

RoryAtBar / Abaqus_bayesian_matflow

Test second shell script `MCMC_800_1s-1.sh` #5