Closed jeff-grann closed 2 months ago
Thanks @jeff-grann. I believe this issue was caused by an extraction that went catastrophically wrong, possibly due to recipe issues (that were fixed in #32 and others) and performance issues (that were fixed in #42). Redetecting the recipe and running an extraction now, it looks like data is being extracted correctly. Let's reopen this issue in case we spot the same problem again.
@rsaksida how does this fix explain the fact that the language of the course description that was extracted does not exist, anywhere, and not anywhere in the Ivy Tech catalog?
Hey @debeverhart. If you check out that particular extraction on development (here), you'll see it downloaded a lot of pages that shouldn't be part of the extraction. I think the LLM was instructed to extract data that wasn't there, and hallucinated a response based on examples in the prompt as well as things it found in the page. In addition, at the time there was a performance issue with the app that caused a lot of weird DB locking problems.
I've ran numerous extractions for Ivy Tech and I don't recall seeing anything like that (you can see one here). That said, LLMs hallucinate - it can happen. In a previous call, I suggested we add a feature to check whether the extracted data is fully or partially present in the source document. That's one way to identify this kind of problem.
Thanks for the useful additional information. What's the status of working on the feature to check the extracted data against the source document? Did we prioritize that? I think that would be a very useful confidence builder for everyone, especially when we introduce the tool to decision-makers like university registrars.
@debeverhart We haven't started on it - I believe it hasn't been prioritized (at least it was unclear to me whether I should work on it at all). I agree it would be a great confidence builder.
I've added #54 to keep track of the text verification feature.
The description seems to be a derived version of the actual description. Example AGRI 101 course description from crawler
AGRI 101 course description from website https://catalog.ivytech.edu/preview_course_nopop.php?catoid=5&coid=15386