Open rvilsack opened 3 months ago
As with other issues, I was able to improve extraction with some changes to recipe detection.
However, even with a correctly configured recipe, the catalogue URL isn't yielding very good results. It looks like that catalogue management system is a little tricky because better data about the courses does exist (see for example ACC 2301 which I found by using the search page), but an index for the data isn't easy to access.
Using that link I found by searching the website, I managed to find an index here but I'm not sure whether that's being linked to anywhere.
I added a configuration for the more complete index here and started an extraction which seems to yield much better results as there's more data to work with.
When that extraction finishes, please check it out - we can keep using this issue.
For the course requisites see #43.
The Texas A&M International University CSV file has the following issues.
AI Course Crawler Extract link: https://master.ai-course-crawler.development.c66.me/datasets/courses/15
Extract file in Google Sheets with comparison to expected: https://docs.google.com/spreadsheets/d/1T-R2UudsM2KdzZtQHX5Xrvo8y8lr4ON6lz6ZJ7L1sME/edit?usp=sharing
The information below is applicable to the majority of courses with this data set.
ALL ISSUES
EXPECTED, see highlighted row in the "Comparison to Correct Bulk Upload" tab in this spreadsheet
Using ARTS 3330 18th & 19th Century Art course as example (it's the example highlighted in the Google sheet linked above)
Expected description: Selected areas of study in the arts of Europe and North America from about 1700 to about 1860. The evolving cultural and economic roles of art, artists, and audiences in the modern era will provide an organizing theme.
Expected URL: https://catalog.tamiu.edu/course-descriptions/arts/
Expected Condition Profile with ConditionProfile: Description: ENGL 1302 or consent of instructor.
Here is what was listed for this course in the URL provided to the crawler:
Here is what was listed for this course in the URL included as InCatalog for ARTS 3330: