FNNDSC / pman

A process management system written in python
MIT License
23 stars 33 forks source link

single worker published file to swift #72

Closed husky-parul closed 6 years ago

husky-parul commented 6 years ago
  1. lockfile is deprecated but fastener was giving lock to all the workers. So, reverted to use lockfile again. Need to work on this. fasteners.InterProcessLock('/tmp/tmp_lock_file')

  2. At times the workers terminates with below errors but the status stays "complete". Under this circumstance when all workers reached complete status no new workers gets created. Result is not published to swift in this scenario. Is this expected behavior?

Error 1: while watching workers

('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

Error 2: while connecting to swift

Unable to establish connection to https://kaizen.massopen.cloud:5000/v3/auth/tokens: HTTPSConnectionPool(host='kaizen.massopen.cloud', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fefaf791ef0>: Failed to establish a new connection: [Errno 110] Connection timed out',))
ravisantoshgudimetla commented 6 years ago

At times the workers terminates with below errors but the status stays "complete". Under this circumstance when all workers reached complete status no new workers gets created. Result is not published to swift in this scenario. Is this expected behavior?

No we should let other workers pick this up.

ravisantoshgudimetla commented 6 years ago

@husky-parul This could be a flake. Can you close and re-open the PR to retrigger the test?

husky-parul commented 6 years ago

@danmcp @ravisantoshgudimetla Added "emptyDir": {} volume on worker pods. Worker that downloads data, creates a dir /cache/winner. While uploading worker with /cache/winner upload objects to Swift.

husky-parul commented 6 years ago

@danmcp @ravisantoshgudimetla This PR looks done to me. Pod that downloaded from Swift waits for all image processing containers to finish. It then uploads data to Swift while other publish containers exit. At this point openshift/pman-swift-publisher/watch.py looks irrelevant. publish container runs openshift/pman-swift-publisher/put_data.py. Should I remove watch.py?

Note: creating a different PR for job deletion

danmcp commented 6 years ago

@ravisantoshgudimetla Any comments?

husky-parul commented 6 years ago

@ravisantoshgudimetla @danmcp what about watch.py?

danmcp commented 6 years ago

what about watch.py?

@husky-parul Are you saying it's no longer needed? If so, I think it can be removed.