Peer Review: Lung Cancer Prediction

Reviewer Name: Viswateja Adothi

IMPORTANT: As a reviewer, you will need to clone their github repo and attempt to reproduce the whole data science pipeline. Document the issues as you come across it so it can help the groups improve later on.

1. Project Structure (25%)

Repository Organization:

The project structure is quite good

Suggestions for improvement:
Can be more organized by adding all the text file to a folder and instead of credentials json that can be added in readme file.

Naming Conventions:

Few files are named as per the snake_case format

Suggestions for improvement:
Can follow snake_case naming format for all the files

README File:

Is very well formatted all the steps are included.

Suggestions for improvement:
There are two readme files ,which may lead to confusion and can add credentials json details in readme. Can also mention about other platforms we have used in the projects like databricks,mongo db.

2. Code Quality (25%)

Code Readability and Structure:

Feedback on the clarity, conciseness, and formatting of the code. code clarity ,formatting and conciseness are good

Suggestions for improvement:
May be adding notebook files addition to script files would provide more clarity about the code

Modularization:

Organization of the code into distinct scripts and functions is quite good

Suggestions for improvement:
May Consider breaking down larger functions into smaller, reusable ones to enhance modularity.

Version Control:

Feedback on the use of Git version control, including commit frequency and documentation. The commit frequency is quite good and documentation is well structured

Suggestions for improvement:
Increasing the frequency of commits would help maintain better documentation and track changes more efficiently

3. Reproducible Environment (25%)

You will need to try to re-run the whole pipeline

Environment Setup:

_Readme is with all steps and easy to understand will all steps included

Suggestions for improvement:
everything is good

Reproducibility test

Were you able to reproduce the pipeline?
- [x] YES
- [ ] NO
What issues (if any) you came across when trying to reproduce their pipeline? None

4. Application of MongoDB and PySpark (25%)

MongoDB Usage:

Mongo db is used for connecting db and but did not find any preprocessing or cleaning steps using mongo db

Suggestions for improvement:
Can consider using mongodb for some preprocessing and data cleaning steps.

PySpark Usage:

Could not find any pyspark files commited.

Suggestions for improvement:
May consider using Pyspark for some preprocessing, transfromations.

5. Rating (Optional)

Overall rating of the project (0 to 100 points):
[Your Rating Here]

End of Review

TRU-PBADS / ADSC_3910_students