Open C4rohan opened 1 year ago
The docker and the commands are tested on a testing instance and running correctly.. The current problem I believe is the same the commands and bootstrap I load doesnt execute on the app created instance. Main reason I think is because of aws set up of resources linked with deploy_config.json .
Hi @C4rohan, thanks for your effort on this repository. Based on your screenshot, I found the CloudWatch Log on link. The log shows that the serverless pipeline cannot find the masterInstanceId.
In deploy_config.json, you have changed the parameter "DLSecurityGroup" to "sg-0f81c232cd437d43f". However, in app.py, I hardcoded the "DLSecurityGroup" to "distributed_dl_starly". I believe that editing the app.py file is a correct direction to solve all the issues.
Also, please try to use CloudWatch to debug the commands that run on the app created instance. It is helpful when I create every new aws serverless on the app.
Thank you for helping out for debugging .. I have changed the "DLSecurityGroup" to "sg-0f81c232cd437d43f" . Thank you again for letting me know about cloudwatch
Thanks for working on it. I want to add some background. The current example only needs one GPU node. @C4rohan is trying to run Python code in serial with one GPU or in parallel with multiple GPUs on the same node.
@starlyxxx We have fallen into another error here is the link --> https://github.com/big-data-lab-umbc/Reproducible_and_portable_app_in_cloud/issues/14
I am trying to create a new aws serverless on the app Here is the aws serverless app link I uploaded to git. --> Link Here is the docker file for the app --> Link Here is the config files in examples --> Link
I am successful in running the the app. There are no errors while resource creation and the output is similar to the author's output.
The problem is I am unaware of ./AwsServerlessTemplate/NewAppTemplate/lambda and its working. I somehow managed to figure out the settings by looking into other applications.
Steps taken to execute the application
Output Observations and questions-->
one click command is supposed to close the cloud formation, instead it takes forever to delete the cloud formation.
I see that the instance that has been created is empty and when opened and monitored we see no execution of the [command ] or [bootstrap].
I think there should be step by step explanation of how does someone decide or set the parameters for deploy_config.json and SampleEvent.json.
Are the steps mentioned on the git the exact steps since nothing is mentioned about deploy_config.json and SampleEvent.json ?and eventually I had edit these two .
Please let me know if any other information is needed on this matter.