CredentialEngine / ai-course-crawler

Apache License 2.0
1 stars 0 forks source link

Unexpected Data in Extract for Forsyth Tech #32

Closed rvilsack closed 2 months ago

rvilsack commented 3 months ago

The Forsyth Tech Bulk Upload file has the following issues.

AI Course Crawler Extract link: https://master.ai-course-crawler.development.c66.me/datasets/courses/16

Extract file in Google Sheets with comparison to expected: https://docs.google.com/spreadsheets/d/1z-8mCAYKzjELVCfZEmTBA4AkBDRxLlmwFTveLIWVh-I/edit?usp=sharing

The information below is applicable to the majority of courses with this data set.

ALL ISSUES

EXPECTED, see highlighted row in the "Comparison to Correct Bulk Upload" tab in this spreadsheet

Rows 2-15 and 25-54 - these are not courses listed on the URL provided Screenshot 2024-08-23 122927

Row 18 "Improving Study Skills"- all course data is extracted correctly; credit unit type is set to SemesterHour when the URL just lists "Credit".

Here is what was listed for this course in the URL provided to the crawler: Screenshot 2024-08-23 122639

Also prerequisites and corequisites are included on URL but not in extract. Could have extracted the following Header Rows: Condition Profile: Condition Type = Requires Condition Profile: Name =The following is required before taking this course Condition Profile: Description = None

Condition Profile: Condition Type = Corequisite Condition Profile: Name = The following is required to be taken with this course Condition Profile: Description = None

(This example listed None for requirements and corequisites; other courses had additional data.)

edgarf commented 2 months ago

@rsaksida - For purposes to release something MVP, let's concentrate on the main fields only, and then when we are sure about the rest of the functionality, let's do things like prerequisites, corequisites, credit-type and other additional fields.

rsaksida commented 2 months ago

Thanks @rvilsack. Addressing the issues one by one:

Unexpected data that is not on URL provided to Course Crawler, including credentials and personal enrichment programs

I believe this was happening because of a problematic recipe that was loading some extra links. I've updated the recipe detection and the new recipe shouldn't pick those up.

That said, it might happen in the future for other catalogues that the recipe extracts more than we'd have liked, but is good otherwise. That's because it can be tricky, especially for an LLM, to generate capture only the course links very specifically. So we might need to keep that limitation in mind, or come up with strategies to work around it. It's something to think about.

Missing Prerequites and Corequisites for course (should be a Condition Profile format)

See #43

Credit Type assumed to be Semester Credit (see https://github.com/CredentialEngine/ai-course-crawler/issues/31)

See #39

Data download has ANSI UTF-8 text encoding issues with export to Excel

See #35