EricWay1024 / nottCrawlerNew

New Crawler for University of Nottingham Course Catalogue (as of October 2024)
1 stars 0 forks source link

Why is selenium necessary? #1

Closed lucienshawls closed 2 weeks ago

lucienshawls commented 2 weeks ago

I wonder if module/fetch_modules.py could use requests rather than selenium, because the latter significantly reduces efficiency, and requires complicated configurations.

EricWay1024 commented 2 weeks ago

I considered that option. The situation is like this: the URL for a module (taking AMCS2033 American Radicalism as an example) is like this

https://campus.nottingham.ac.uk/psc/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=American%20Radicalism&MODULE=AMCS2033&CRSEID=016489&LINKA=&LINKB=&LINKC=UDD-ACS

where there is this parameter CRSEID=016489 that we need to obtain. Unfortunately this does not seem trivial. This CRSEID is not fetched when the list of modules (like here) is loaded; rather, it is fetched after you click on that module in the list, and then a POST request is sent with the following data (converted from CURL command thanks to https://curlconverter.com/):

import requests

cookies = {}  # Omitted
headers = {}  # Omitted

data = {
    'ICAJAX': '1',
    'ICNAVTYPEDROPDOWN': '0',
    'ICType': 'Panel',
    'ICElementNum': '0',
    'ICStateNum': '18',
    'ICAction': 'ADDRESS_LINK$0',
    'ICModelCancel': '0',
    'ICXPos': '0',
    'ICYPos': '0',
    'ResponsetoDiffFrame': '-1',
    'TargetFrameName': 'None',
    'FacetPath': 'None',
    'ICFocus': '',
    'ICSaveWarningFilter': '0',
    'ICChanged': '0',
    'ICSkipPending': '0',
    'ICAutoSave': '0',
    'ICResubmit': '0',
    'ICSID': 'LSvvCKSCxe6C7js/87aD51fizvCE3qaSdjvS3zDAogo=',
    'ICActionPrompt': 'false',
    'ICTypeAheadID': '',
    'ICBcDomData': 'C~UnknownValue~EMPLOYEE~HRMS~UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL~UN_CRS_EXT2_FPG~Curriculum Catalogue~UnknownValue~UnknownValue~https://campus.nottingham.ac.uk/psc/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=&Module=&SCHOOL=UDD-ACS&LINKA=~UnknownValue',
    'ICDNDSrc': '',
    'ICPanelHelpUrl': '',
    'ICPanelName': '',
    'ICPanelControlStyle': 'pst_side2-hidden pst_panel-mode ',
    'ICFind': '',
    'ICAddCount': '',
    'ICAppClsData': '',
    'win0hdrdivPT_SYSACT_RETLST': 'psc_hidden',
    'win0hdrdivPT_SYSACT_HELP': 'psc_hidden',
}

response = requests.post(
    'https://campus.nottingham.ac.uk/psc/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL',
    cookies=cookies,
    headers=headers,
    data=data,
)

where data is pretty complicated. Now it is actually possible to obtain the CRSEID of the modules belonging to the same school by modifying the data fields ICStateNum and ICAction: for the former you need to increment it by one every time, and for the latter the number after ADDRESS_LINK$ corresponds to the index of the module of interest in the module list. So far so good. However, it is not clear yet to me how to obtain the CRSEID for modules of other schools. Naturally, you'd think that you should change ICBcDomData, which seems to record all the pages you have visited (by observing a few such POST requests). Further, you'd think you should change the SCHOOL= field there so as to trick the server into thinking that your last visited page is the module list of the school of interest. And being very careful, you'd also modify the SCHOOL parameters of the Referer field in the headers. But after all these, the request wouldn't give you the right data; the data for the old school is returned, if anything. I doubt that the server remembers the pages you visited through Cookies or other intricate means that I have no intention to figure out, and thus it appears to me that selenium is the quickest way to get things done.

EricWay1024 commented 2 weeks ago

That said, even with the knowledge above, it is possible to make the crawler (much?) faster. For each school, we still need to use Selenium to get a 'prototype' request like above by clicking on the first module of the school; but then we can obtain other modules using requests only by modifying the request data in the way stated above. But then the Selenium dependency is still there, so not a perfect solution.

EricWay1024 commented 2 weeks ago

Or maybe we want to use requests.Session() to help us handle the Cookies. That may work but I haven't investigated.

EricWay1024 commented 2 weeks ago

The following code doesn't work; post_data.html reads <title>Campus System Requires Cookies</title>. We do have Cookies sent, as is printed out. Also list.html is as expected. Not really sure why this is the case, maybe we need to change ICSID field as well, but I have no idea where that comes from. Anyway it seems a hard problem not to rely on Selenium. I'll leave my discoveries here in case anyone wants to tackle the issue and find them useful.


import requests
from requests.utils import dict_from_cookiejar
from pprint import pprint

# Example headers to mimic a browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
}

SCHOOL = 'USC-ENGL'
session = requests.Session()

response = session.get(f'https://campus.nottingham.ac.uk/psc/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT2_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=&Module=&SCHOOL={SCHOOL}&LINKA=&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=&Module=&SCHOOL={SCHOOL}', headers=headers)

with open("list.html", "w") as f:
    f.write(response.text)

data = {
    'ICAJAX': '1',
    'ICNAVTYPEDROPDOWN': '0',
    'ICType': 'Panel',
    'ICElementNum': '0',
    'ICStateNum': '1',
    'ICAction': 'ADDRESS_LINK$1',
    'ICModelCancel': '0',
    'ICXPos': '0',
    'ICYPos': '0',
    'ResponsetoDiffFrame': '-1',
    'TargetFrameName': 'None',
    'FacetPath': 'None',
    'ICFocus': '',
    'ICSaveWarningFilter': '0',
    'ICChanged': '0',
    'ICSkipPending': '0',
    'ICAutoSave': '0',
    'ICResubmit': '0',
    'ICSID': 'LSvvCKSCxe6C7js/87aD51fizvCE3qaSdjvS3zDAogo=',
    'ICActionPrompt': 'false',
    'ICTypeAheadID': '',
    'ICBcDomData': f'C~UnknownValue~EMPLOYEE~HRMS~UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL~UN_CRS_EXT2_FPG~Curriculum Catalogue~UnknownValue~UnknownValue~https://campus.nottingham.ac.uk/psc/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=&Module=&SCHOOL={SCHOOL}&LINKA=~UnknownValue',
    'ICDNDSrc': '',
    'ICPanelHelpUrl': '',
    'ICPanelName': '',
    'ICPanelControlStyle': 'pst_side2-hidden pst_panel-mode ',
    'ICFind': '',
    'ICAddCount': '',
    'ICAppClsData': '',
    'win0hdrdivPT_SYSACT_RETLST': 'psc_hidden',
    'win0hdrdivPT_SYSACT_HELP': 'psc_hidden',
}

pprint(dict_from_cookiejar(session.cookies))

response = session.post(
    'https://campus.nottingham.ac.uk/psc/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL',
    data=data,
)

with open("post_data.html", "w", encoding='utf-8') as f:
    f.write(response.text)
lucienshawls commented 2 weeks ago

In the list.html, there is ICSID and some other variables:

<input type='hidden' name='ICSID' id='ICSID' value='DgIumPq0ZvYZTuW2qjmdLIZ+67qu0xEV2gFAe94Zyxc=' />

So I use bs4 to obtain the ICSID from the first request (that gets list.html), and then send the second request with the specified ICSID:

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
element = soup.find(id="ICSID")
ICSID = element.get("value")
# print(ICSID)
data = {..., "ICSID": ICSID, ...} # The rest remains the same with your code

and get something new (not <title>Campus System Requires Cookies</title>). See if it helps.

EricWay1024 commented 2 weeks ago

Very cool! I think we get the right thing.

There is a line in the returned document:

processing_win0(0,3000);]]></GENSCRIPT><GENSCRIPT id='onloadScript'><![CDATA[document.location='/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Processing%20Sentences%20and%20Discourse&MODULE=ENGL4387&CRSEID=035633&LINKA=&LINKB=&LINKC=USC-ENGL';]]></GENSCRIPT>

which contains the link we need. It's just a matter of regex to get it.

Thanks a lot for the help!

lucienshawls commented 2 weeks ago

That is great!

But, just being curious, why am i getting CRSEID=035632 while you got CRSEID=035633? I did not modify anything in the data except ICSID. Did you use different data? If not, something strange is happening.

EricWay1024 commented 2 weeks ago

Sorry for the confusion. I modified the index just to test it and I copied the modified result.

lucienshawls commented 2 weeks ago

That is OK.

I tried to eliminate as many keys in data as possible. According to your explanations and the tests we have been through, I kept the following keys without trying to remove them: ICStateNum, ICAction and ICSID. I applied the binary search and got the results:

Code here:

import requests
from requests.utils import dict_from_cookiejar
from pprint import pprint
from bs4 import BeautifulSoup

# Example headers to mimic a browser
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

SCHOOL = "USC-ENGL"
session = requests.Session()

response = session.get(
    f"https://campus.nottingham.ac.uk/psc/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT2_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=&Module=&SCHOOL={SCHOOL}&LINKA=&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=&Module=&SCHOOL={SCHOOL}",
    headers=headers,
)

with open("list.html", "w") as f:
    f.write(response.text)

soup = BeautifulSoup(response.text, "html.parser")
element = soup.find(id="ICSID")
ICSID = element.get("value")
data = {
    "ICAJAX": "1",
    "ICStateNum": "1",
    "ICAction": "ADDRESS_LINK$1",
    "ICSID": ICSID,
}

pprint(dict_from_cookiejar(session.cookies))

response = session.post(
    "https://campus.nottingham.ac.uk/psc/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL",
    data=data,
)

with open("post_data.html", "w", encoding="utf-8") as f:
    f.write(response.text)
EricWay1024 commented 2 weeks ago

Thanks for the effort, but I don't really mind leaving the fields as-is to make the crawler look more similar to an actual user

lucienshawls commented 2 weeks ago

Yes, that is always a good idea. So I assume this issue can be safely closed. :)

EricWay1024 commented 2 weeks ago

Not yet, I'll close it when I update the crawler to remove Selenium. There's still work to do, namely how to use concurrency to get the module links in a more efficient way. Still working... 👨‍🔧

lucienshawls commented 2 weeks ago

Turns out, a session CAN be reused. Once a session is established and ICSID is acquired, you can get all module url with the same session and ICSID. What's more, concurrency works fine.

Oh, by the way, ICStateNum can also be removed. If you insist that it should not be removed, there is actually no need to increase it by one each time. Just set it to "1" and everything is fine.

The example module page that I used: Curriculum Catalogue

Code with concurrency:

import re
import requests
import concurrent.futures
from bs4 import BeautifulSoup

HOST = "https://campus.nottingham.ac.uk"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
    "Connection": "keep-alive",
    "Upgrade-Insecure-Requests": "1",
}

def init_session():
    session = requests.Session()
    url = f"{HOST}/psc/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL"
    params = {
        "PAGE": "UN_CRS_EXT2_FPG",
        "CAMPUS": "U",
        "TYPE": "Module",
        "YEAR": "2024",
        "TITLE": "",
        "Module": "",
        "SCHOOL": "UDD-ACS",
        "LINKA": "",
    }
    # https://campus.nottingham.ac.uk/psc/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT2_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=&Module=&SCHOOL=UDD-ACS&LINKA=
    response = session.get(
        url=url,
        params=params,
        headers=HEADERS,
    )
    soup = BeautifulSoup(response.text, "html.parser")
    icsid_entry = soup.find(id="ICSID")
    icsid = icsid_entry.get("value")

    rows = soup.find_all("tr", class_="ps_grid-row")
    return {"session": session, "ICSID": icsid, "total": len(rows)}

def get_module_url(session: requests.Session, icsid: str, index: int):
    data = {
        "ICAJAX": "1",
        "ICStateNum": "1",
        "ICAction": f"ADDRESS_LINK${str(index)}",
        "ICSID": icsid,
    }
    response = session.post(
        f"{HOST}/psc/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL",
        data=data,
    )
    re_obj = re.search(r"document.location='(.*)';", response.text)
    if re_obj:
        module_url = f"{HOST}{re_obj.group(1)}"
    else:
        module_url = ""
        # raise RuntimeError("URL not found")
    return module_url

def get_module_list():
    cred = init_session()
    session: requests.Session = cred["session"]
    icsid: str = cred["ICSID"]
    total: int = cred["total"]
    module_list = [None for _ in range(total)]
    with concurrent.futures.ThreadPoolExecutor() as executor:
        future_to_index = {
            executor.submit(
                get_module_url,
                session=session,
                icsid=icsid,
                index=index,
            ): index
            for index in range(cred["total"])
        }
        for future in concurrent.futures.as_completed(future_to_index):
            index = future_to_index[future]
            module_url = future.result()

            print(f"Module {index} URL: {module_url}")

            module_list[index] = module_url
    session.close()
    return module_list

import time

start = time.time()
module_list = get_module_list()
end = time.time()

print(f"Time taken: {end - start:.2f}s")

with open("module_list.txt", "w", encoding="utf-8") as f:
    for module in module_list:
        f.write(f"{module}\n")

Output:

Module 0 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Prohibition%20America%20(UG%2020%20credits)&MODULE=AMCS3024&CRSEID=013585&LINKA=&LINKB=&LINKC=UDD-ACS
Module 5 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Race,%20Power,%20Money%20and%20the%20Making%20of%20North%20America%201607%20-%201900&MODULE=AMCS1001&CRSEID=010244&LINKA=&LINKB=&LINKC=UDD-ACS
Module 11 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Approaches%20to%20Contemporary%20American%20Culture%201:%20An%20Introduction&MODULE=AMCS1030&CRSEID=018020&LINKA=&LINKB=&LINKC=UDD-ACS
Module 3 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Popular%20Music%20Cultures%20and%20Countercultures&MODULE=AMCS3045&CRSEID=015501&LINKA=&LINKB=&LINKC=UDD-ACS
Module 8 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Approaches%20to%20Contemporary%20American%20Culture%202:%20Developing%20Themes%20and%20Perspectives&MODULE=AMCS1031&CRSEID=017127&LINKA=&LINKB=&LINKC=UDD-ACS
Module 4 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=American%20Radicalism&MODULE=AMCS2033&CRSEID=016489&LINKA=&LINKB=&LINKC=UDD-ACS
Module 12 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=African%20American%20History%20and%20Culture&MODULE=AMCS2052&CRSEID=018148&LINKA=&LINKB=&LINKC=UDD-ACS
Module 10 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=American%20Literature%20and%20Culture%202:%20Since%201940&MODULE=AMCS1011&CRSEID=010270&LINKA=&LINKB=&LINKC=UDD-ACS
Module 1 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Dissertation%20in%20American%20and%20Canadian%20Studies&MODULE=AMCS3004&CRSEID=008334&LINKA=&LINKB=&LINKC=UDD-ACS
Module 2 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Dissertation%20in%20American%20and%20Canadian%20Studies&MODULE=AMCS3006&CRSEID=008532&LINKA=&LINKB=&LINKC=UDD-ACS
Module 7 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=American%20Freedom%20Empire,%20Rights,%20and%20Capitalism%20in%20Modern%20US%20History,%201900-Present&MODULE=AMCS1009&CRSEID=010269&LINKA=&LINKB=&LINKC=UDD-ACS
Module 17 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=American%20Magazine%20Culture:%20Journalism,%20Advertising%20and%20Fiction%20from%20Independence%20to%20the%20Internet%20Age&MODULE=AMCS3069&CRSEID=031751&LINKA=&LINKB=&LINKC=UDD-ACS
Module 23 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=MRes%20Research%20Skills%201&MODULE=AMCS4082&CRSEID=033599&LINKA=&LINKB=&LINKC=UDD-ACS
Module 20 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Popular%20Music%20Cultures%20&%20Countercultures%20(PGT%2020)&MODULE=AMCS4070&CRSEID=031971&LINKA=&LINKB=&LINKC=UDD-ACS
Module 16 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Contemporary%20North%20American%20Fiction&MODULE=AMCS2056&CRSEID=030744&LINKA=&LINKB=&LINKC=UDD-ACS
Module 26 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=US%20Foreign%20Policy,%201989-Present%20(UG%20-%2020%20credits)&MODULE=AMCS3025&CRSEID=013584&LINKA=&LINKB=&LINKC=UDD-ACS
Module 6 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=American%20Literature%20and%20Culture%201:%201830-1940&MODULE=AMCS1005&CRSEID=010246&LINKA=&LINKB=&LINKC=UDD-ACS
Module 15 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Sexuality%20in%20American%20History%20(Level%203)&MODULE=AMCS3061&CRSEID=022814&LINKA=&LINKB=&LINKC=UDD-ACS
Module 13 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=North%20American%20Regions&MODULE=AMCS2054&CRSEID=022802&LINKA=&LINKB=&LINKC=UDD-ACS
Module 19 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Immigration%20and%20Ethnicity%20in%20the%20United%20States&MODULE=AMCS2007&CRSEID=011262&LINKA=&LINKB=&LINKC=UDD-ACS
Module 18 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=North%20American%20Film%20Adaptations%20(Level%203)&MODULE=AMCS3068&CRSEID=031902&LINKA=&LINKB=&LINKC=UDD-ACS
Module 25 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=The%20CIA%20and%20US%20Foreign%20Policy,%201945-2012&MODULE=AMCS2058&CRSEID=034740&LINKA=&LINKB=&LINKC=UDD-ACS
Module 9 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=The%20US%20&%20the%20World%20in%20the%20American%20Century:%20US%20Foreign%20Policy,%201898-2008&MODULE=AMCS2048&CRSEID=017243&LINKA=&LINKB=&LINKC=UDD-ACS
Module 24 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Troubled%20Empire:%20The%20Projection%20of%20American%20Global%20Power%20from%20Pearl%20Harbor%20to%20Covid-19&MODULE=AMCS3074&CRSEID=034232&LINKA=&LINKB=&LINKC=UDD-ACS
Module 14 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Key%20Texts%20in%20American%20Social%20and%20Political%20Thought&MODULE=AMCS2055&CRSEID=022803&LINKA=&LINKB=&LINKC=UDD-ACS
Module 21 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=From%20Landscapes%20to%20Mixtapes:%20Canadian%20Literature,%20Film%20and%20Culture&MODULE=AMCS1008&CRSEID=011286&LINKA=&LINKB=&LINKC=UDD-ACS
Module 22 URL: https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Varieties%20of%20Classic%20American%20Film,%20Television%20and%20Literature%20Since%201950&MODULE=AMCS3071&CRSEID=033030&LINKA=&LINKB=&LINKC=UDD-ACS
Time taken: 8.07s

Temporary file module_list.txt (correct order):

https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Prohibition%20America%20(UG%2020%20credits)&MODULE=AMCS3024&CRSEID=013585&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Dissertation%20in%20American%20and%20Canadian%20Studies&MODULE=AMCS3004&CRSEID=008334&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Dissertation%20in%20American%20and%20Canadian%20Studies&MODULE=AMCS3006&CRSEID=008532&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Popular%20Music%20Cultures%20and%20Countercultures&MODULE=AMCS3045&CRSEID=015501&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=American%20Radicalism&MODULE=AMCS2033&CRSEID=016489&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Race,%20Power,%20Money%20and%20the%20Making%20of%20North%20America%201607%20-%201900&MODULE=AMCS1001&CRSEID=010244&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=American%20Literature%20and%20Culture%201:%201830-1940&MODULE=AMCS1005&CRSEID=010246&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=American%20Freedom%20Empire,%20Rights,%20and%20Capitalism%20in%20Modern%20US%20History,%201900-Present&MODULE=AMCS1009&CRSEID=010269&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Approaches%20to%20Contemporary%20American%20Culture%202:%20Developing%20Themes%20and%20Perspectives&MODULE=AMCS1031&CRSEID=017127&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=The%20US%20&%20the%20World%20in%20the%20American%20Century:%20US%20Foreign%20Policy,%201898-2008&MODULE=AMCS2048&CRSEID=017243&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=American%20Literature%20and%20Culture%202:%20Since%201940&MODULE=AMCS1011&CRSEID=010270&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Approaches%20to%20Contemporary%20American%20Culture%201:%20An%20Introduction&MODULE=AMCS1030&CRSEID=018020&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=African%20American%20History%20and%20Culture&MODULE=AMCS2052&CRSEID=018148&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=North%20American%20Regions&MODULE=AMCS2054&CRSEID=022802&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Key%20Texts%20in%20American%20Social%20and%20Political%20Thought&MODULE=AMCS2055&CRSEID=022803&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Sexuality%20in%20American%20History%20(Level%203)&MODULE=AMCS3061&CRSEID=022814&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Contemporary%20North%20American%20Fiction&MODULE=AMCS2056&CRSEID=030744&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=American%20Magazine%20Culture:%20Journalism,%20Advertising%20and%20Fiction%20from%20Independence%20to%20the%20Internet%20Age&MODULE=AMCS3069&CRSEID=031751&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=North%20American%20Film%20Adaptations%20(Level%203)&MODULE=AMCS3068&CRSEID=031902&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Immigration%20and%20Ethnicity%20in%20the%20United%20States&MODULE=AMCS2007&CRSEID=011262&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Popular%20Music%20Cultures%20&%20Countercultures%20(PGT%2020)&MODULE=AMCS4070&CRSEID=031971&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=From%20Landscapes%20to%20Mixtapes:%20Canadian%20Literature,%20Film%20and%20Culture&MODULE=AMCS1008&CRSEID=011286&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Varieties%20of%20Classic%20American%20Film,%20Television%20and%20Literature%20Since%201950&MODULE=AMCS3071&CRSEID=033030&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=MRes%20Research%20Skills%201&MODULE=AMCS4082&CRSEID=033599&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=Troubled%20Empire:%20The%20Projection%20of%20American%20Global%20Power%20from%20Pearl%20Harbor%20to%20Covid-19&MODULE=AMCS3074&CRSEID=034232&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=The%20CIA%20and%20US%20Foreign%20Policy,%201945-2012&MODULE=AMCS2058&CRSEID=034740&LINKA=&LINKB=&LINKC=UDD-ACS
https://campus.nottingham.ac.uk/psp/csprd_pub/EMPLOYEE/HRMS/c/UN_PROG_AND_MOD_EXTRACT.UN_PLN_EXTRT_FL_CP.GBL?PAGE=UN_CRS_EXT4_FPG&CAMPUS=U&TYPE=Module&YEAR=2024&TITLE=US%20Foreign%20Policy,%201989-Present%20(UG%20-%2020%20credits)&MODULE=AMCS3025&CRSEID=013584&LINKA=&LINKB=&LINKC=UDD-ACS
EricWay1024 commented 2 weeks ago

Thanks for those brilliant ideas @lucienshawls ! Now the multi-thread structure is used in the new module crawler (see here), which is way faster and stabler than the Selenium-based one.