UTDNebula / api-tools

CLI-based tool which facilitates the scraping, parsing, and uploading of data for Nebula Labs' API.
MIT License
4 stars 6 forks source link

Evaluations Data #6

Open AdamMcAdamson opened 1 year ago

AdamMcAdamson commented 1 year ago

We would like to provide evaluation data as part of our API.

To this effect, we need to:

kneevin commented 1 year ago

hey! i'd like to take on this scraper.

AdamMcAdamson commented 1 year ago

That sound good to me, @hochladen?

Also, this issue should probably be moved to https://github.com/UTDNebula/api-tools.

jpahm commented 1 year ago

I'm open to it; though I can say that this could also be implemented as a part of the existing coursebook scraper (as well as anything else we may need to pull from coursebook)

I'm not entirely opposed to having this as a separate scraper, but I'd say it's a matter of considering if the separation of tasks would be worth the added clutter.

AdamMcAdamson commented 1 year ago

Right, I had forgotten that we can pull eval data from the coursebook scraper with the speedup.

kneevin commented 1 year ago

so should i try to refactor the current scraper so it can scrape the eval. data as well? or is it ok to have two separate scrapers? i think it'll make more sense to refactor the current scraper it'll just take me a bit more time to figure it out

jpahm commented 1 year ago

Adding it to the current scraper would probably be best, though I still have to push it since it's still WIP and stored on my PC locally at the moment. It'll probably be a day or two before I do that, since I'm currently in the middle of moving back home.

jpahm commented 11 months ago

Yeah so it's fair to say "a day or two" was a vast underestimate of how long this would take to add; regardless, it should be added soon as part of the existing scraper now that other priorities have been taken care of.

jpahm commented 7 months ago

I've completed the scraper component of this work, but there are some concerns regarding IP ratelimits to be addressed. A data model and associated database changes still need to be completed.

jpahm commented 7 months ago

I've completed the scraper component of this work, but there are some concerns regarding IP ratelimits to be addressed. A data model and associated database changes still need to be completed.

Upon further investigation of this, I'm not seeing any immediate great solutions for the IP ratelimit problem -- this problem also occurs with scraping courses, but in a much more manageable fashion. Scraping evals leads to a long IP ratelimit every 30-40 evals or so, which obviously isn't sustainable for scraping en masse.

A solution to this issue that I proposed to @iamwood would be to set up an API endpoint for evals that parses and returns specific evals on-the-fly* rather than parsing them all en-masse. I'll discuss this alongside some other things once the semester starts rolling more.

Any thoughts on this issue are welcome!

*Alongside some sort of caching would be preferred

democat3457 commented 7 months ago

+1 for caching + evals on-the-fly, I think it's a good compromise.

jpahm commented 5 months ago

So, after putting together an on-demand scraping endpoint for evals, it seems like we are now being hindered on this front by evals being locked behind captcha verification. I'm not sure if there's any way to circumvent this, but I'm out of ideas for the time being. As such, I'm going to be putting this issue on hold in favor of prioritization of other tasks.