educational-technology-collective / morf

The MOOC Replication Framework (MORF)
MIT License
16 stars 7 forks source link

UnicodeDecodeError in reading coursera_course_dates.csv #66

Open edfincham opened 5 years ago

edfincham commented 5 years ago

My setup uses Pandas 0.24.2, so this might not be a problem in other environments.

Ran into the following traceback when loading when executing my rewritten fetch_start_end_date function (this should probably be in an alternative issue, but the original function causes a file not found error because it isn't looking the /input/course/session directory):

def fetch_start_end_date(course, session_dir, date_csv="coursera_course_dates.csv"):
    """
    Fetch course start end end date (so user does not have to specify them directly).
    :param course: course name.
    :param session_dir: input directory.
    :param date_csv: Path to csv of course start/end dates.
    :return: tuple of datetime objects (course_start, course_end)
    """
    date_df = pd.read_csv(
        "{}{}".format(session_dir, date_csv),
        error_bad_lines=False
    ).set_index("course")

    course_start = datetime.strptime(date_df.loc[course].start_date, "%m/%d/%y")
    course_end = datetime.strptime(date_df.loc[course].end_date, "%m/%d/%y")
    return course_start, course_end

Due to some encoding problems, this returns the following error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 38: invalid start byte

I fixed this locally by changing the pd.read_csv() call to include:

date_df = pd.read_csv(
        "{}{}".format(session_dir, date_csv),
        error_bad_lines=False,
        encoding="ISO-8859-1"
).set_index("course")