StackOverflowMATLABchat / MATLABfcnscrape

Scrape MATLAB's documentation for all function names and output to JSON files for external use
2 stars 1 forks source link

Investigate Methods for Increasing Performance #7

Closed sco1 closed 3 years ago

sco1 commented 3 years ago

Parsing the dynamically served toolbox documentation (R2018b and newer) takes a long time; on the order of 15-20 minutes depending on the release. Why? Who knows! Let's profile it and find out.

Looking at the profiles for the static documentation content, which average around 15-30 seconds, ~70% of the time is used by the web requests and ~25% of the time parsing the response with BeautifulSoup. While some of the remaining time could probably be cleaned up, why bother?

So without yet profiling my guess is that Selenium, and/or the way we're utilizing it, is incurring a huge amount of overhead, so let's figure that out and fix it 😃

edit: After some digging, there's an API based approach that allows us to ditch selenium completely. Yay!