HBCUMobility / datacollection

1 stars 3 forks source link

Generate base process for identifying historical locations of departments that have no captures #16

Open machawk1 opened 2 years ago

machawk1 commented 2 years ago

Unlike #7, some departmental web page return 0 mementos, i.e., an empty TimeMap. These web pages or a precursor may have existed at a different URL; for example, from english.university.edu to university.edu/depts/english. Digging into the data will help to reveal these trends. This process will require consulting the historical captures of the universities' home pages, scraping or parsing out the departmental URLs, and identifying where the web page for the departmental web page once resided -- if at all.

We may encounter an issue of new departments at the university that did not exist in the past.

I anticipate this process to go as follows:

  1. Identify a small set of departmental web pages that have no historical captures.
  2. Using an initially manual process, look at the older mementos for the university's homepage to identify an instance of departments existing elsewhere.
  3. Use a tool like BeautifulSoup to scrape/parse out the historical URL (URI-M) of the department and respective faculty pages.