FNNDSC / pman

A process management system written in python
MIT License
23 stars 33 forks source link

Pman is unable to handle concurrent Med2Img Jobs #138

Closed arnavn101 closed 4 years ago

arnavn101 commented 4 years ago

I was testing pfcon's handling of concurrent jobs with FS and DS Plugins. I used a python script that executed pfurl requests for running Med2Img jobs on multiple DICOM images. However, pman was not able to process all the jobs since it returned an error; but, when I ran the jobs with a large time gap (30-60 sec) between each job, pman was able to successfully complete all the jobs.

These are the steps that I went through:

[First Terminal]

  1. git clone git@github.com:FNNDSC/pfcon.git
  2. cd pfcon
  3. ./unmake.sh ; sudo rm -fr FS; rm -fr FS;./make.sh

[Second Terminal (in the pfcon directory)]

  1. git clone https://github.com/FNNDSC/SAG-anon
  2. export DICOMDIR=$(pwd)/SAG-anon.
  3. docker pull fnndsc/pl-med2img
  4. ./swiftCtl.sh -A push -E dcm -D $DICOMDIR -P chris/uploads/DICOM/dataset1

After pushing the DICOM files to swift, I ran a python script that executed FS and DS Plugins on pfcon.

  1. git clone git@github.com:arnavnidumolu/ChRIS-E2E.git

  2. cd ChRIS-E2E/scale-testing/

  3. I setup my configuration options on config.cfg --> nano config.cfg (Edit CHRIS_PATH)

  4. Lastly, I executed the python script --> python test_pfcon.py

The python script uses these two bash scripts to run FS and DS Plugins:

  1. FS Plugin Script
  2. DS Plugin (Med2Img) Script

Analysis

The FS Plugin job ran successfully and returned a valid response. Additionally, the first two DS plugin jobs were successful but the next few DS Plugin jobs did not return a "finishedSuccessfully" response.

I used docker-compose -f docker-compose_dev.yml logs -f pman_service in the pfcon directory to view the pman container logs. Within the logs, it displayed this error message:

pman_service_1   | Traceback (most recent call last):
pman_service_1   |   File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
pman_service_1   |     self.run()
pman_service_1   |   File "/usr/local/lib/python3.8/dist-packages/pman/pman.py", line 567, in run
pman_service_1   |     self.within.DB_fileIO(cmd = 'save')
pman_service_1   |   File "/usr/local/lib/python3.8/dist-packages/pman/pman.py", line 314, in DB_fileIO
pman_service_1   |     if self.str_fileio   == 'json':     saveToDiskAsJSON(tree_DB)
pman_service_1   |   File "/usr/local/lib/python3.8/dist-packages/pman/pman.py", line 274, in saveToDiskAsJSON
pman_service_1   |     tree_DB.tree_save(
pman_service_1   |   File "/usr/local/lib/python3.8/dist-packages/pfmisc/C_snode.py", line 1326, in tree_save
pman_service_1   |     self.treeExplore(**kwargs)
pman_service_1   |   File "/usr/local/lib/python3.8/dist-packages/pfmisc/C_snode.py", line 1424, in treeExplore
pman_service_1   |     ret = f(str_startPath, **kwargs)
pman_service_1   |   File "/usr/local/lib/python3.8/dist-packages/pfmisc/C_snode.py", line 1140, in node_save
pman_service_1   |     str_pathDiskOrig    = os.getcwd()
pman_service_1   | FileNotFoundError: [Errno 2] No such file or directory

After a few seconds passed, Pman returned another error message:

pman_service_1   | Traceback (most recent call last):
pman_service_1   |   File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
pman_service_1   |     self.run()
pman_service_1   |   File "/usr/local/lib/python3.8/dist-packages/pman/pman.py", line 667, in run
pman_service_1   |     resultFromProcessing    = self.process(request)
pman_service_1   |   File "/usr/local/lib/python3.8/dist-packages/pman/pman.py", line 2165, in process
pman_service_1   |     self.processPOST(   request = d_request,
pman_service_1   |   File "/usr/local/lib/python3.8/dist-packages/pman/pman.py", line 2240, in processPOST
pman_service_1   |     d_done              = eval("self.t_%s_process(request = d_request)" % payload_verb)
pman_service_1   |   File "<string>", line 1, in <module>
pman_service_1   |   File "/usr/local/lib/python3.8/dist-packages/pman/pman.py", line 1071, in t_status_process
pman_service_1   |     self.dp.qprint("------- In status process ------------")
pman_service_1   |   File "/usr/local/lib/python3.8/dist-packages/pfmisc/debug.py", line 131, in qprint
pman_service_1   |     stack = inspect.stack()
pman_service_1   |   File "/usr/lib/python3.8/inspect.py", line 1514, in stack
pman_service_1   |     return getouterframes(sys._getframe(1), context)
pman_service_1   |   File "/usr/lib/python3.8/inspect.py", line 1491, in getouterframes
pman_service_1   |     frameinfo = (frame,) + getframeinfo(frame, context)
pman_service_1   |   File "/usr/lib/python3.8/inspect.py", line 1461, in getframeinfo
pman_service_1   |     filename = getsourcefile(frame) or getfile(frame)
pman_service_1   |   File "/usr/lib/python3.8/inspect.py", line 708, in getsourcefile
pman_service_1   |     if getattr(getmodule(object, filename), '__loader__', None) is not None:
pman_service_1   |   File "/usr/lib/python3.8/inspect.py", line 737, in getmodule
pman_service_1   |     file = getabsfile(object, _filename)
pman_service_1   |   File "/usr/lib/python3.8/inspect.py", line 721, in getabsfile
pman_service_1   |     return os.path.normcase(os.path.abspath(_filename))
pman_service_1   |   File "/usr/lib/python3.8/posixpath.py", line 379, in abspath
pman_service_1   |     cwd = os.getcwd()
pman_service_1   | FileNotFoundError: [Errno 2] No such file or directory

Even though Pman returned an error message, I confirmed that it actually ran the job. To view the converted jpg files from the DS Plugin, I utilized the FS directory that was created in the pfcon directory.

ls FS/remote/key-16/outgoing/ # 16 refers to the Job ID which returns:

sample16-slice001.jpg  sample16-slice040.jpg  sample16-slice079.jpg  sample16-slice118.jpg  sample16-slice157.jpg ...

Conclusion

Since the DICOM files were successfully converted and stored in the FS directory, the job should have returned a finishedSuccesfully response. However, pman displayed an error message after running the DS plugin and it did not return a successful response when pfurl status command was executed. Pman ran the first two jobs successfully and returned a valid response, but it was unable to handle the next few jobs. It was only able to run all the jobs successfully when there was a time gap between job execution, allowing it to finish current jobs before moving onto other jobs.

rudolphpienaar commented 4 years ago

Great issue reporting! I'll see if I can replicate when I get a chance.