MAPC / myschoolcommute2

Tool to Survey and Analyze Transportation Mode and Behavior
http://masaferoutessurvey.org/
1 stars 3 forks source link

Differential Between R Report and Website Completed Surveys #147

Open mzagaja opened 6 years ago

mzagaja commented 6 years ago

The reported number of completed surveys on the website currently differs from the reported number of completed surveys in the PDF report generated by R. Initially the difference was an undercount. This is due to how the SQL view that we use to calculate the reports in the report generator was discarding surveys that did not have data in the fields for dropoff and pickup which refer to whether folks dropoff or pickup their child on the way to another activity. This question is only relevant and only appears in the user interface if the student is dropped off in a family vehicle. Otherwise, it is always empty. The result was that students choosing other transit modes were not be included in the report. We fix this by updating the SQL for this view in the forthcoming pull request.

The second component is after fixing the SQL is figuring out why there is now an overcount of 7 surveys in the R report versus the website. Further research revealed that 3 reports counted on the website were actually blank. So the real distance was 10. The goal is to figure out where and how we are getting a count of 10 bonus surveys.

One attempted solution was blanking empty string cells with the following SQL:

UPDATE
    TableName
SET
    grade_1 = CASE grade_1 WHEN '' THEN NULL ELSE grade_1 END,
    dropoff_1 = CASE dropoff_1 WHEN '' THEN NULL ELSE dropoff_1 END,
    to_school_1 = CASE to_school_1 WHEN '' THEN NULL ELSE to_school_1 END,
    from_school_1 = CASE from_school_1 WHEN '' THEN NULL ELSE from_school_1 END,
    grade_2 = CASE grade_2 WHEN '' THEN NULL ELSE grade_2 END,
    dropoff_2 = CASE dropoff_2 WHEN '' THEN NULL ELSE dropoff_2 END,
    to_school_2 = CASE to_school_2 WHEN '' THEN NULL ELSE to_school_2 END,
    from_school_2 = CASE from_school_2 WHEN '' THEN NULL ELSE from_school_2 END

This did not resolve the issue with the report count.

The raw number of rows going into the R report after blanking empty cells and deleting the three missing surveys is 132 (previously 141). This is seen by calling the following SQL:

SELECT melted.survey_response_id,
  melted.survey_id,
  melted.distance,
  ST_ASTEXT(melted.geometry),
  melted.created,
  melted.modified,
  melted.shed,
  melted.dropoff,
  melted.from_school,
  melted.grade,
  melted.pickup,
  melted.to_school,
  melted.school_id,
  melted.nr_licenses,
  melted.nr_vehicles,
  melted.schid
 FROM ( SELECT survey_responses.id AS survey_response_id,
          survey_responses.survey_id,
          survey_responses.distance,
          survey_responses.geometry,
          survey_responses.created_at AS created,
          survey_responses.updated_at AS modified,
          survey_responses.shed,
          survey_responses.nr_licenses,
          survey_responses.nr_vehicles,
          unnest(ARRAY[survey_responses.dropoff_0, survey_responses.dropoff_1, survey_responses.dropoff_10, survey_responses.dropoff_11, survey_responses.dropoff_12, survey_responses.dropoff_13, survey_responses.dropoff_14, survey_responses.dropoff_15, survey_responses.dropoff_16, survey_responses.dropoff_17, survey_responses.dropoff_18, survey_responses.dropoff_19, survey_responses.dropoff_2, survey_responses.dropoff_3, survey_responses.dropoff_4, survey_responses.dropoff_5, survey_responses.dropoff_6, survey_responses.dropoff_7, survey_responses.dropoff_8, survey_responses.dropoff_9]) AS dropoff,
          unnest(ARRAY[survey_responses.from_school_0, survey_responses.from_school_1, survey_responses.from_school_10, survey_responses.from_school_11, survey_responses.from_school_12, survey_responses.from_school_13, survey_responses.from_school_14, survey_responses.from_school_15, survey_responses.from_school_16, survey_responses.from_school_17, survey_responses.from_school_18, survey_responses.from_school_19, survey_responses.from_school_2, survey_responses.from_school_3, survey_responses.from_school_4, survey_responses.from_school_5, survey_responses.from_school_6, survey_responses.from_school_7, survey_responses.from_school_8, survey_responses.from_school_9]) AS from_school,
          unnest(ARRAY[survey_responses.grade_0, survey_responses.grade_1, survey_responses.grade_10, survey_responses.grade_11, survey_responses.grade_12, survey_responses.grade_13, survey_responses.grade_14, survey_responses.grade_15, survey_responses.grade_16, survey_responses.grade_17, survey_responses.grade_18, survey_responses.grade_19, survey_responses.grade_2, survey_responses.grade_3, survey_responses.grade_4, survey_responses.grade_5, survey_responses.grade_6, survey_responses.grade_7, survey_responses.grade_8, survey_responses.grade_9]) AS grade,
          unnest(ARRAY[survey_responses.pickup_0, survey_responses.pickup_1, survey_responses.pickup_10, survey_responses.pickup_11, survey_responses.pickup_12, survey_responses.pickup_13, survey_responses.pickup_14, survey_responses.pickup_15, survey_responses.pickup_16, survey_responses.pickup_17, survey_responses.pickup_18, survey_responses.pickup_19, survey_responses.pickup_2, survey_responses.pickup_3, survey_responses.pickup_4, survey_responses.pickup_5, survey_responses.pickup_6, survey_responses.pickup_7, survey_responses.pickup_8, survey_responses.pickup_9]) AS pickup,
          unnest(ARRAY[survey_responses.to_school_0, survey_responses.to_school_1, survey_responses.to_school_10, survey_responses.to_school_11, survey_responses.to_school_12, survey_responses.to_school_13, survey_responses.to_school_14, survey_responses.to_school_15, survey_responses.to_school_16, survey_responses.to_school_17, survey_responses.to_school_18, survey_responses.to_school_19, survey_responses.to_school_2, survey_responses.to_school_3, survey_responses.to_school_4, survey_responses.to_school_5, survey_responses.to_school_6, survey_responses.to_school_7, survey_responses.to_school_8, survey_responses.to_school_9]) AS to_school,
          schools.id AS school_id,
          schools.schid
         FROM survey_responses
         INNER JOIN surveys ON survey_responses.survey_id = surveys.id
         INNER JOIN schools ON surveys.school_id = schools.id
        ORDER BY survey_responses.id) melted
WHERE melted.from_school IS NOT NULL AND melted.grade IS NOT NULL AND melted.to_school IS NOT NULL AND survey_id = 34671;
mzagaja commented 6 years ago

After investigating further I have learned the following:

  1. The generate_report.R script loads the R environment by referencing load(.RData.R) which may or may not contain arbitrary functions and variables that are available to generate_report.R. There are functions in generate_report.R that we could not initially use/debug when loading into R studio like get_dates because they were not otherwise loaded/available in the environment.

  2. Enrollment information is contained in a static CSV file enrollment_15_16.csv which is used for generating the R report. There are also some date specific functions like get_enrollment_df that have a large number of conditionals and tap out after 15_16, effectively creating a Y2K style bug in the app.

A larger scope of work is needed to refactor and debug the R script going forward.