JingHuangLab / openmm_deepmd_plugin

19 stars 5 forks source link

Errors when running `make test` #19

Closed varun-go closed 9 months ago

varun-go commented 11 months ago

Thank you for creating this plugin!

I am having an issue during the installation process. I am able to successfully run the make and make install commands. However, when I run make test, all of the tests fail, as shown below.

Running tests...
/common/software/install/spack/linux-centos7-ivybridge/gcc-13.1.0/cmake-3.26.3-em4tlmoxf4uhny57w2i5f6vfdfk6v4wy/bin/ctest --force-new-ctest-process 
Test project gopal145/software/openmm_deepmd_plugin/build
    Start 1: TestSerializeDeepmdForce
1/5 Test #1: TestSerializeDeepmdForce .........***Failed    3.20 sec
    Start 2: TestDeepmdPlugin4Reference
2/5 Test #2: TestDeepmdPlugin4Reference .......***Failed    0.58 sec
    Start 3: TestDeepmdPlugin4CUDASingle
3/5 Test #3: TestDeepmdPlugin4CUDASingle ......Subprocess aborted***Exception:   0.77 sec
    Start 4: TestDeepmdPlugin4CUDAMixed
4/5 Test #4: TestDeepmdPlugin4CUDAMixed .......Subprocess aborted***Exception:   0.60 sec
    Start 5: TestDeepmdPlugin4CUDADouble
5/5 Test #5: TestDeepmdPlugin4CUDADouble ......Subprocess aborted***Exception:   0.60 sec

0% tests passed, 5 tests failed out of 5

Total Test time (real) =   5.78 sec

The following tests FAILED:
      1 - TestSerializeDeepmdForce (Failed)
      2 - TestDeepmdPlugin4Reference (Failed)
      3 - TestDeepmdPlugin4CUDASingle (Subprocess aborted)
      4 - TestDeepmdPlugin4CUDAMixed (Subprocess aborted)
      5 - TestDeepmdPlugin4CUDADouble (Subprocess aborted)
Errors while running CTest

I looked at the test log file located at build/Testing/Temporary/LastTest.log and I see the following message:

Start testing: Dec 06 20:00 CST
----------------------------------------------------------
1/5 Testing: TestSerializeDeepmdForce
1/5 Test: TestSerializeDeepmdForce
Command: "gopal145/software/openmm_deepmd_plugin/build/TestSerializeDeepmdForce"
Directory: gopal145/software/openmm_deepmd_plugin/build
"TestSerializeDeepmdForce" start time: Dec 06 20:00 CST
Output:
----------------------------------------------------------
DeePMD-kit: Successfully load libcudart.so
DeePMD-kit WARNING: Environmental variable TF_INTRA_OP_PARALLELISM_THREADS is not set. Tune TF_INTRA_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable TF_INTER_OP_PARALLELISM_THREADS is not set. Tune TF_INTER_OP_PARALLELISM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DeePMD-kit WARNING: Environmental variable OMP_NUM_THREADS is not set. Tune OMP_NUM_THREADS for the best performance. See https://deepmd.rtfd.io/parallelism/ for more information.
DATA_LOSS: Can't parse ../python/OpenMMDeepmdPlugin/data/water.pb as binary proto
exception: DeePMD-kit C API Error: DeePMD-kit Error: TensorFlow Error: DATA_LOSS: Can't parse ../python/OpenMMDeepmdPlugin/data/water.pb as binary proto
<end of output>
Test time =   3.20 sec
----------------------------------------------------------
Test Failed.
"TestSerializeDeepmdForce" end time: Dec 06 20:00 CST
"TestSerializeDeepmdForce" time elapsed: 00:00:03
----------------------------------------------------------

The remaining 4 tests all fail for a similar reason. I have verified that the water.pb file does exist. I understand that the warning states that certain environment variables are not set, but I am not sure it would cause the following error. I also looked at issue #9 and exported the LD_LIBRARY_PATH variable, but I observed the same error. Can someone please provide some assistance? Thank you.

varun-go commented 9 months ago

I am following up on this issue to see if someone can assist. Thank you.

varun-go commented 9 months ago

I was able to resolve this issue. The problem was that the water.pb file was not in the correct format. I confirmed this by running the following command in shell: file water.pb and the output is water.pb: ASCII text. However, the output should be data.

I cloned the repository separately on two machines and observed this file format to be of type ASCII text. A collaborator, however, found that the file format was data. When using the data file, the DATA_LOSS error in the original message no longer occurs.