HakaiInstitute / GEM-in-a-box-dataset-repository-template

Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Potential GitHub Actions to add to repo #4

Open JessyBarrette opened 2 months ago

JessyBarrette commented 2 months ago

In an ideal world it would be good to implement an action to this repo that can be run on push to main which does the following tests:

Create an issue for each failed checks.

timvdstap commented 2 months ago

So proposed actions / checks, to run when merging survey-date-branch into main:

timvdstap commented 1 month ago

Tagging the Science Team to check whether there's any checks you would like to see implemented, or prioritize the checks listed above: @jdelbel @hakaidrew @CarrieWeekes @naomiboon7

fostermh commented 1 month ago

A minimal example to check that a date is in the correct format and survey station exists in stations.csv The global variables are poor coding practice.

import pandas as pd
import json
import unittest

# read in csv as pandas dataframes
survey_final_df = pd.read_csv(
    './data/2024-09-16_example_dataset/2024-09-16_survey_final.csv', sep=',')
stations_df = pd.read_csv('./stations.csv', sep=',')

class TestSurveyFinal(unittest.TestCase):
    def test_survey_date(self):
        # iterate over the rows and assert that the survey date is in the correct format. if not, throw exception.
        for index, row in survey_final_df.iterrows():
            self.assertRegex(
                row['Survey Date'],
                r'^\\d{4}\\-(0?[1-9]|1[012])\\-(0?[1-9]|[12][0-9]|3[01])$',
                f'Survey Date of {row['Survey Date']
                                  } does not match YYYY-MM-DD format'
            )
    def test_survey_stations(self):
        # find all survey stations that do not have a match in stations.csv
        joined_df = survey_final_df.merge(stations_df, 'left',
                              left_on='Station', right_on='station_id')
        missing_stations = list(
            set(
                joined_df[joined_df['station_id'].isnull()].Station
            )
        )
        count = len(missing_stations)
        self.assertEqual(count, 0, f"Survey stations missing from stations.csv: {', '.join(missing_stations)}")

unittest.main()