CredentialEngine / ai-course-crawler

Apache License 2.0
1 stars 0 forks source link

CourseLeaf Round 2 Testing - Missing Credit Values + Incorrect Credit Unit Types #53

Open rvilsack opened 1 month ago

rvilsack commented 1 month ago

I tested 2 CourseLeaf catalogs; this was second round testing.

South Texas College

URL: https://catalog.southtexascollege.edu/courses/ Link to output file: https://docs.google.com/spreadsheets/d/1p0EZ23zV-I8qPST0MBv9sU6SK7McXV2C/edit?usp=sharing&ouid=115685232190749733039&rtpof=true&sd=true Number of courses look good Data looks good ISSUE missing credit values (`3% of records), incorrect credit value type + a small number of course descriptions are truncated

Example: no credit values included in extract, but listed in catalog (https://catalog.southtexascollege.edu/courses/rbtc/)

image

Example: incorrect credit value type, nothing on page suggests semester (https://catalog.southtexascollege.edu/courses/math/)

image

These issues seem to be isolated to full sets of courses under a heading (RBTC, MATH, etc.)

There are also a few course descriptions that are truncated: image

Here is what appears on the page for these courses: image

Deleware Community College

URL: https://catalog.dccc.edu/courses/course-descriptions/ Link to output file: https://docs.google.com/spreadsheets/d/1mr3Aqjr3hw0p5ScvfV-rX90mnmSeQgoV/edit?usp=sharing&ouid=115685232190749733039&rtpof=true&sd=true Number of courses look good Data looks good ISSUE missing credit values (17% of records), incorrect credit value type + truncated course descriptions; I'm not providing screen shots, since it's exactly the same as the above but the output file has some examples highlighted

If there is a pattern to the missing credit values or truncated descriptions, I haven't found it.

rvilsack commented 1 month ago

@rsaksida Re: your slack question

_Can you help me understand how the credits are supposed to be parsed? CRT HRS:4 LEC HRS:4 LAB HRS:1 OTH HRS:0 I assume this means Credit hrs - 4 Lecture hrs - 4 Lab hrs - 1 Other hrs - 0 How would this translate to the 4 columns we're using: Credit Unit Value Credit Unit Max Value Credit Unit Type Credit Unit Type Description__

All of the MATH courses (row 547-563 in the download file) assumed a course unit type = semester hour, when this was not the case for other courses. (Nothing on the MATH URL suggests semester, which is what I thought you'd adjusted the crawler to look for.)

See a constrasting example for instance row 546: image

Here is what the URL for that course displays: image

Similar display of data between these two examples, yet 546 was parsed correctly: Credit Unit Value = 4 Credit Unit Max Value Credit Unit Type Credit Unit Type Description = This has credit value, but the type cannot be determined

I expected the all of the MATH courses (row 547-563) to be parsed the same way.