CredentialEngine / ai-course-crawler

Apache License 2.0
1 stars 0 forks source link

Course description does not match website description #36

Closed jeff-grann closed 2 months ago

jeff-grann commented 2 months ago

The description seems to be a derived version of the actual description. Example AGRI 101 course description from crawler Screenshot 2024-08-27 130016

AGRI 101 course description from website https://catalog.ivytech.edu/preview_course_nopop.php?catoid=5&coid=15386 Screenshot 2024-08-27 130041

rsaksida commented 2 months ago

Thanks @jeff-grann. I believe this issue was caused by an extraction that went catastrophically wrong, possibly due to recipe issues (that were fixed in #32 and others) and performance issues (that were fixed in #42). Redetecting the recipe and running an extraction now, it looks like data is being extracted correctly. Let's reopen this issue in case we spot the same problem again.

debeverhart commented 1 month ago

@rsaksida how does this fix explain the fact that the language of the course description that was extracted does not exist, anywhere, and not anywhere in the Ivy Tech catalog?

rsaksida commented 1 month ago

Hey @debeverhart. If you check out that particular extraction on development (here), you'll see it downloaded a lot of pages that shouldn't be part of the extraction. I think the LLM was instructed to extract data that wasn't there, and hallucinated a response based on examples in the prompt as well as things it found in the page. In addition, at the time there was a performance issue with the app that caused a lot of weird DB locking problems.

I've ran numerous extractions for Ivy Tech and I don't recall seeing anything like that (you can see one here). That said, LLMs hallucinate - it can happen. In a previous call, I suggested we add a feature to check whether the extracted data is fully or partially present in the source document. That's one way to identify this kind of problem.

debeverhart commented 1 month ago

Thanks for the useful additional information. What's the status of working on the feature to check the extracted data against the source document? Did we prioritize that? I think that would be a very useful confidence builder for everyone, especially when we introduce the tool to decision-makers like university registrars.

rsaksida commented 1 month ago

@debeverhart We haven't started on it - I believe it hasn't been prioritized (at least it was unclear to me whether I should work on it at all). I agree it would be a great confidence builder.

rsaksida commented 1 month ago

I've added #54 to keep track of the text verification feature.