IMPORTANT: As a reviewer, you will need to clone their github repo and attempt to reproduce the whole data science pipeline. Document the issues as you come across it so it can help the groups improve later on.
1. Project Structure (25%)
Repository Organization:
The project structure is quite good
Suggestions for improvement:
Can be more organized by adding all the text file to a folder and instead of credentials json that can be added in readme file.
Naming Conventions:
Few files are named as per the snake_case format
Suggestions for improvement:
Can follow snake_case naming format for all the files
README File:
Is very well formatted all the steps are included.
Suggestions for improvement:
There are two readme files ,which may lead to confusion and can add credentials json details in readme.
Can also mention about other platforms we have used in the projects like databricks,mongo db.
2. Code Quality (25%)
Code Readability and Structure:
Feedback on the clarity, conciseness, and formatting of the code.
code clarity ,formatting and conciseness are good
Suggestions for improvement:
May be adding notebook files addition to script files would provide more clarity about the code
Modularization:
Organization of the code into distinct scripts and functions is quite good
Suggestions for improvement:
May Consider breaking down larger functions into smaller, reusable ones to enhance modularity.
Version Control:
Feedback on the use of Git version control, including commit frequency and documentation.
The commit frequency is quite good and documentation is well structured
Suggestions for improvement:
Increasing the frequency of commits would help maintain better documentation and track changes more efficiently
3. Reproducible Environment (25%)
You will need to try to re-run the whole pipeline
Environment Setup:
_Readme is with all steps and easy to understand will all steps included
Suggestions for improvement:
everything is good
Reproducibility test
Were you able to reproduce the pipeline?
[x] YES
[ ] NO
What issues (if any) you came across when trying to reproduce their pipeline?
None
4. Application of MongoDB and PySpark (25%)
MongoDB Usage:
Mongo db is used for connecting db and but did not find any preprocessing or cleaning steps using mongo db
Suggestions for improvement:
Can consider using mongodb for some preprocessing and data cleaning steps.
PySpark Usage:
Could not find any pyspark files commited.
Suggestions for improvement:
May consider using Pyspark for some preprocessing, transfromations.
5. Rating (Optional)
Overall rating of the project (0 to 100 points):
[Your Rating Here]
Peer Review: Lung Cancer Prediction
Reviewer Name: Viswateja Adothi
IMPORTANT: As a reviewer, you will need to clone their github repo and attempt to reproduce the whole data science pipeline. Document the issues as you come across it so it can help the groups improve later on.
1. Project Structure (25%)
Repository Organization:
The project structure is quite good
Can be more organized by adding all the text file to a folder and instead of credentials json that can be added in readme file.
Naming Conventions:
Few files are named as per the snake_case format
Can follow snake_case naming format for all the files
README File:
Is very well formatted all the steps are included.
There are two readme files ,which may lead to confusion and can add credentials json details in readme. Can also mention about other platforms we have used in the projects like databricks,mongo db.
2. Code Quality (25%)
Code Readability and Structure:
Feedback on the clarity, conciseness, and formatting of the code. code clarity ,formatting and conciseness are good
May be adding notebook files addition to script files would provide more clarity about the code
Modularization:
Organization of the code into distinct scripts and functions is quite good
May Consider breaking down larger functions into smaller, reusable ones to enhance modularity.
Version Control:
Feedback on the use of Git version control, including commit frequency and documentation. The commit frequency is quite good and documentation is well structured
Increasing the frequency of commits would help maintain better documentation and track changes more efficiently
3. Reproducible Environment (25%)
You will need to try to re-run the whole pipeline
Environment Setup:
_Readme is with all steps and easy to understand will all steps included
everything is good
Reproducibility test
What issues (if any) you came across when trying to reproduce their pipeline? None
4. Application of MongoDB and PySpark (25%)
MongoDB Usage:
Mongo db is used for connecting db and but did not find any preprocessing or cleaning steps using mongo db
Can consider using mongodb for some preprocessing and data cleaning steps.
PySpark Usage:
Could not find any pyspark files commited.
May consider using Pyspark for some preprocessing, transfromations.
5. Rating (Optional)
[Your Rating Here]
End of Review