Submission: Group 16: Stellar Classification Predictor

Submitting authors: Aron Bahram, Olivia Lam, Lucy Liu, and Viet Ngo

Repository: https://github.com/DSCI-310-2024/DSCI310-Group16-Stellar_Classification/releases/tag/v0.1.5

Abstract/executive summary:

Our project looks towards the skies to classify stars to their given spectral types according to their different electromagnetic radiation magnitudes. Our goal is to expand our understanding of stars through their five radiation band types, and explore how data analysis can further our knowledge beyond our galaxy through the study of photometry, dynamics of celestial bodies, and stellar interactions. Our research comes from a data set on planetary systems from NASA’s Exoplanet Archive. Our simple categorization of stars may seem small, but it contributes to the bigger pursuit of celestial research and perhaps even planetary exploration.

Editor: @ttimbers

Reviewer: Anshoor Kaur, Oliver Gullery, An Zhou, Xander Dawson

[ ] I agree to abide by DSCI 310's Code of Conduct during the review process.

Data analysis review checklist

Reviewer: anshnoorkaur

Conflict of interest

[Y] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[Y] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[Y] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[Y] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[Y] Installation instructions: Is there a clearly stated list of dependencies?
[Y] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[Y] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[Y] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[Y] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[Y] Style guidelides: Does the code adhere to well known language style guides?
[Y] Modularity: Is the code suitably abstracted into scripts and functions?
[Y] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[Y] Data: Is the raw data archived somewhere? Is it accessible?
[Y] Computational methods: Is all the source code required for the data analysis available?
[Y] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[N] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[Y] Authors: Does the report include a list of authors with their affiliations?
[Y] What is the question: Do the authors clearly state the research question being asked?
[Y] Importance: Do the authors clearly state the importance for this research question?
[Y] Background: Do the authors provide sufficient background information so that readers can understand the report?
[Y] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[Y] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[Y] Conclusions: Are the conclusions presented by the authors correct?
[Y] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[Y] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 2

Review Comments:

Please provide more detailed feedback here on what was done particularly well, and what could be improved. It is especially important to elaborate on items that you were not able to check off in the list above.

A lot of the documentation for figures and tables in the rendered pdf report says “Figure ??” And “Table ??”. These need to be updated to reflect the correct figures and tables.
I tried to reproduce the analysis using the “Reproducing the results in a docker container” section of the ReadMe but was unable to do so. Step 2 of the instructions is missing a “.” in the end. The general execution even after adding it wasn’t successful on my end. It might also just be my computer (MacOS) but it is worth looking into. A little more documentation stating to go into the project directory after step 1 might also be useful! Here is a snippet of the error:

Reading state information... Done

Dockerfile:34

32 | RUN curl -o quarto-linux-amd64.deb -L https://github.com/quarto-dev/quarto-cli/releases/download/v${QUARTO_VERSION}/quarto-${QUARTO_VERSION}-linux-amd64.deb 33 | RUN apt-get install gdebi-core -y 34 | >>> RUN gdebi quarto-linux-amd64.deb --non-interactive 35 | # install TeX for quarto 36 | RUN quarto install tinytex

ERROR: failed to solve: process "/bin/sh -c gdebi quarto-linux-amd64.deb --non-interactive" did not complete successfully: exit code: 1

Some improvements can also be made in the documentation within functions and tests to ensure consistency in the documentation approach. I saw more than one style of documentation formatting in the functions and tests files as well as some missing documentation in the same files.

These are just a few improvement points. Overall, great job!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: brico12

Conflict of interest

[Y] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[Y] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[Y] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[Y] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[Y] Installation instructions: Is there a clearly stated list of dependencies?
[Y] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[N] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[Y] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[Y] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[Y] Style guidelides: Does the code adhere to well known language style guides?
[Y] Modularity: Is the code suitably abstracted into scripts and functions?
[Y] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[Y ] Data: Is the raw data archived somewhere? Is it accessible?
[Y] Computational methods: Is all the source code required for the data analysis available?
[Y] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[N] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[Y] Authors: Does the report include a list of authors with their affiliations?
[Y] What is the question: Do the authors clearly state the research question being asked?
[Y] Importance: Do the authors clearly state the importance for this research question?
[Y] Background: Do the authors provide sufficient background information so that readers can understand the report?
[Y] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[Y] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[N] Conclusions: Are the conclusions presented by the authors correct?
[Y] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[Y] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1

Review Comments:

For the function documentation, the functions are clearly shown and documented in the code. However, the functions this analysis is capable of could be shown more intuitive to users. This might be done by, for example, adding a new section in the readme file.
The report is written properly in concise, comprehensive language, providing authors’ insights into the analyzing results. However, the structure and content of the analysis do not quite satisfy all the requirements. It would be better if some more in-depth analysis were provided in the discussion session because the content existing, for now, tends to have too much of a description of the result, instead of the authors’ own voice. Also, the conclusion part is missing.
The authors did a fantastic job in code readability. Authors make it easy for readers to understand the purpose of the analytical code by clearly commenting on the code functions. This is very helpful for achieving a trustworthy workflow.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: ollie-gullery

Conflict of interest

[Y ] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[Y ] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[ Y] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[Y ] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[Y ] Installation instructions: Is there a clearly stated list of dependencies?
[Y ] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[Y ] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level?
[Y ] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[Y ] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[ Y] Style guidelides: Does the code adhere to well known language style guides?
[ Y] Modularity: Is the code suitably abstracted into scripts and functions?
[ Y] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[Y ] Data: Is the raw data archived somewhere? Is it accessible?
[Y ] Computational methods: Is all the source code required for the data analysis available?
[ Y] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[N] Automation: Can someone other than the authors easily reproduce the entire data analysis?

Analysis report

[Y ] Authors: Does the report include a list of authors with their affiliations?
[ Y] What is the question: Do the authors clearly state the research question being asked?
[Y ] Importance: Do the authors clearly state the importance for this research question?
[ Y] Background: Do the authors provide sufficient background information so that readers can understand the report?
[ Y] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[ Y] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[Y ] Conclusions: Are the conclusions presented by the authors correct?
[ Y] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[Y ] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 1

Review Comments:

Firstly, the report is really well done, in particular, the language is concise and really easy to follow. It breaks down the topic to help readers understand, in particular, the definition section of the report helped me understand the purpose of the report a lot better as initially I was confused over some of the topics.
The formatting is really strong in the report, however, there are times when numbers of headings in the table are too long and overlap which causes the tables to be difficult to follow (table 3 and 5 of the pdf specifically). One way to address this issue could be making the number of significant figures of each number in a table 4, this could help with consistency and reduce overlapping in the tables!
The code was really well written, in particular, comments above different sections of the code help significantly with readability. This was done particularly well in the data_eda.py file where there were doc strings explaining the functionality of each function in addition to comments above different sections of code. However, in most of the test files this same level of depth was not replicated. Adding doc strings and comments above code in this section could really help improve the overall readability of the code for the tests and make it more clear why you guys chose the tests you did!

In terms of the N for automation, the docker build in the instructions on the Github page did not work for me and returned an error!

Great job overall though guys!

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

Data analysis review checklist

Reviewer: adaws01

Conflict of interest

[X] As the reviewer I confirm that I have no conflicts of interest for me to review this work.

Code of Conduct

[X] I confirm that I read and will adhere to the MDS code of conduct.

General checks

[X] Repository: Is the source code for this data analysis available? Is the repository well organized and easy to navigate?
[X] License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?

Documentation

[X] Installation instructions: Is there a clearly stated list of dependencies?
[X] Example usage: Do the authors include examples of how to use the software to reproduce the data analysis?
[ ] Functionality documentation: Is the core functionality of the data analysis software documented to a satisfactory level? (good documentation within report, and class level comments)
[X] Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Code quality

[X] Readability: Are scripts, functions, objects, etc., well named? Is it relatively easy to understand the code?
[X] Style guidelides: Does the code adhere to well known language style guides?
[X] Modularity: Is the code suitably abstracted into scripts and functions?
[X] Tests: Are there automated tests or manual steps described so that the function of the software can be verified? Are they of sufficient quality to ensure software robsutness?

Reproducibility

[X] Data: Is the raw data archived somewhere? Is it accessible?
[X] Computational methods: Is all the source code required for the data analysis available?
[X] Conditions: Is there a record of the necessary conditions (software dependencies) needed to reproduce the analysis? Does there exist an easy way to obtain the computational environment needed to reproduce the analysis?
[ ] Automation: Can someone other than the authors easily reproduce the entire data analysis? (yes, though with unneeded difficulty)

Analysis report

[X] Authors: Does the report include a list of authors with their affiliations?
[X] What is the question: Do the authors clearly state the research question being asked?
[X] Importance: Do the authors clearly state the importance for this research question?
[X] Background: Do the authors provide sufficient background information so that readers can understand the report?
[X] Methods: Do the authors clearly describe and justify the methodology used in the data analysis? Do the authors communicate any assumptions or limitations of their methodologies?
[X] Results: Do the authors clearly communicate their findings through writing, tables and figures?
[X] Conclusions: Are the conclusions presented by the authors correct?
[X] References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
[X] Writing quality: Is the writing of good quality, concise, engaging?

Estimated hours spent reviewing: 3

Review Comments:

I can second Anshoor's advice above—I followed your included steps to build the Docker container, which was missing the ending "."
There seems to be an issue with the container image install. I don't receive an error message, though my install appears to freeze on step 2/12: "RUN conda install -c conda-forge -y python=3.11 scikit-learn=1.2.2 matplotlib==3.8.3 pandas=2." My install took about 40 minutes on this step, and swiftly moved through the other steps. A review of the project's docker image may be useful—not sure exactly what caused the stutter. Ran on macOS Sonoma 14.2.1, Intel chip, steady WIFI connection (tested at 97mbps).
Objects appear to be named appropriately within the code, though I would appreciate having some additional guidance in the form of class level comments describing the positions of scripts within the analysis pipeline. Perhaps the addition of an ordering convention within script names would assist readers greatly in understanding the code. Nice documentation via code comments!
Robust test suite, appreciate the addition of the README doc
Be sure to review formatting Issues with certain tables and figures in PDF format: Including
- Tables 3, 4, 5, 6, 7
- Figures 2, 3, 4, 5, 6 These issues are present within the original PDF report, and in my rerun of the analysis. It seems a simple addition in the form of additional spacing, or perhaps truncating decimal values would vastly improve legibility.

Attribution

This was derived from the JOSE review checklist and the ROpenSci review checklist.

DSCI-310-2024 / data-analysis-review-2024