Course data request sequencing

aksiksi commented 5 years ago

Hello!

I stumbled across your page while looking for others who have implemented a Banner data scraper that works with the "updated" design.

I maintain a course scheduling web app for my alma mater, UAE University. The app recently broke after the new Banner version was pushed out, so I've been trying to get things working again.

I currently have a working proof-of-concept. Initially, I used Selenium + headless Firefox to get the first set of cookies to use with Requests, but apparently the API doesn't even check them?

Anyways, I can update this issue with a flow of requests once fully tested, if you're interested. Hopefully it will also work for TAMU's Banner instance ;)

gannonprudhomme commented 5 years ago

Hey!

I would definitely be interested! My next recent task was also to figure out how to replicate those cookies that Banner uses, as I can retrieve some data(terms, departments) without it, but more specific stuff like courses require them. As such, I'm very eager to find out what your working solution would be.

Was really the only cookie you needed the JSESSIONID? I've been able to get all of the courses by just copying that cookie into Postman and searching that way, but haven't found a way to replicate the cookies in my code. As such,, I was also planning to run a headless browser to emulate them.

aksiksi commented 5 years ago

Here is what I found.

From the UI, you start a new search by selecting the term/semester (for my university, but URL path should be the same): https://eservices.uaeu.ac.ae/StudentRegistrationSsb/ssb/term/termSelection?mode=search

When you submit this form, a POST request is sent out that starts a search. The relevant data sent with the request consists of a "uniqueSessionId" and "term".

I had a look at the client-side JS, and this is basically how they generate the session ID and send the request:

import random
import string

import requests

# Generate a random 18 character session ID
# From Banner JS: combine a random 5-letter string with current timestamp
session_id = "".join(random.sample(string.ascii_lowercase, 5)) + get_new_timestamp()

# Start a search
data = {
    "uniqueSessionId": session_id,
    "dataType": "json",
    "term": latest_term,
}
resp = session.post("https://eservices.uaeu.ac.ae/StudentRegistrationSsb/ssb/term/search?mode=search", data=data)

The "uniqueSessionId" seems to tie the selected term to any subsequent search.

I believe a GET request to this path will change the term ID for the current session ID, but haven't yet tested it: /StudentRegistrationSsb/ssb/term/saveTerm?mode=search&uniqueSessionId={session_id}&dataType=json&term=202010

To actually search for courses:

# Term appears to be ignored here
subject = "MATH"
resp = session.get(f"https://eservices.uaeu.ac.ae/StudentRegistrationSsb/ssb/searchResults/searchResults?txt_subject={subject}&txt_term={latest_term}&startDatepicker=&endDatepicker=&pageOffset=0&pageMaxSize=10&sortColumn=subjectDescription&sortDirection=asc&uniqueSessionId={session_id}")

To start a new search for a different subject with the same session ID and term, you need to send a reset request:

resp = session.post("https://eservices.uaeu.ac.ae/StudentRegistrationSsb/ssb/classSearch/resetDataForm", data={"uniqueSessionId": session_id})

So as far as I can tell, the trick lies in:

Starting a new search via POST
Creating the session ID and passing it to subsequent requests

As far as I can tell, cookies are not actually checked, at least not in my case.

gannonprudhomme commented 5 years ago

This worked perfectly! You can see my implementation of it here.

For at least TAMU, the StudentRegistrationSSb/sbb/term/saveTerm/ URL isn't necessary to add the current term from what I have seen. Simply changing the term parameter for the URL's were all I needed.

Also in regards to cookies, I believe the cookies that are given(JSESSIONID, etc) are "attached" to the uniqueSessionId, as when I remove the uniqueSessionId from the course search after I searched using it previously, it sends me the response it just gave me(similar to what happens when you try to search twice without resetting). However, when I remove the uniqueSessionId and the cookies, it just returns empty data. That being said, you are right in that cookies aren't "checked", in the sense that you don't have to provide any to it explicitly.

Let me know if there's anything I can help with, you've been a giant help and I can't thank you enough 😄

aksiksi commented 5 years ago

Great to hear it worked for you! I'll probably implement something similar to grab course data.

I'll update this thread if I hit any issues on my side, and you do the same. But I'm hoping this should be enough when it comes to course data.

gannonprudhomme commented 5 years ago

Sounds good, will do.

Also just a thought, but we could make this a unified API/Python package, where when you create the like Banner/package object you just pass in the school specific URL(i.e. eservices.uaeu.ac.ae for UAE, and compassxe-ssb.tamu.edu for TAMU's. This is assuming we wouldn't get like a seise and desist from Blackboard haha.

aksiksi commented 5 years ago

That is a good idea actually. What kind of APIs would make sense from your side?

I wrote a basic version that takes one or more terms and, optionally, a list of subjects and fetches all of the data using asyncio and aiohttp. See: https://github.com/aksiksi/jadawil/blob/new-banner/grabber_v2.py

But it seems like you get rate limited fairly quickly. I'm seeing responses like this:

{'success': False, 'totalCount': 111, 'data': [], 'pageOffset': 0, 'pageMaxSize': 500, 'sectionsFetchedCount': 111, 'pathMode': 'search', 'searchResultsConfigs': [{'config': 'courseTitle', 'display': 'Title', 'title': 'Title', 'width': '9%'}, {'config': 'subjectDescription', 'display': 'Subject Description', 'title': 'Subject Description', 'width': '5%'}, {'config': 'courseNumber', 'display': 'Course Number', 'title': 'Course Number', 'width': '3%'}, {'config': 'sequenceNumber', 'display': 'Section', 'title': 'Section', 'width': '3%'}, {'config': 'creditHours', 'display': 'Hours', 'title': 'Hours', 'width': '3%'}, {'config': 'courseReferenceNumber', 'display': 'CRN', 'title': 'CRN', 'width': '3%'}, {'config': 'term', 'display': 'Term', 'title': 'Term', 'width': '3%'}, {'config': 'instructor', 'display': 'Instructor', 'title': 'Instructor', 'width': '8%'}, {'config': 'meetingTime', 'display': 'Meeting Times', 'title': 'Meeting Times', 'width': '15%'}, {'config': 'campus', 'display': 'Campus', 'title': 'Campus', 'width': '3%'}, {'config': 'status', 'display': 'Status', 'title': 'Status', 'width': '6%'}, {'config': 'scheduleType', 'display': 'Schedule Type', 'title': 'Schedule Type', 'width': '5%'}, {'config': 'attribute', 'display': 'Attribute', 'title': 'Attribute', 'width': '14%'}], 'ztcEncodedImage': None}

success is False, yet a correct count is returned...

aksiksi commented 5 years ago

OK, so my current theory is that unauthenticated users get rate limited fairly quickly. This might be configurable by the university, so you might not hit this issue.

I'm probably going to use Selenium to login first, then pass the cookies to Python to fetch the data through direct HTTP requests.

edit: Nevermind... looks like pageMaxSize should be less than 50 or so.

gannonprudhomme commented 5 years ago

Yeah I don't believe A&M provides those restrictions, I set my pageMaxSize to 1000 and it gets me every course in the department/subject consistently

aksiksi commented 5 years ago

Good to know. The IT folks at UAEU probably messed up the config as paging doesn't even seem to work reliably through the web UI...

But I think they've set the max results per page to 50, both on the fronted and the backend. My API will therefore require some basic paging support. Basically, I will have it send a request for 50, and keep making requests until the data array is empty.

FYI, the pageOffset refers to the results index, not the page index. So requests would go like this for the case of 110 results:

pageOffset=0,   pageMaxSize=50 # 50 results returned
pageOffset=50,  pageMaxSize=50 # Next 50
pageOffset=100, pageMaxSize=50 # Last 10
pageOffset=150, pageMaxSize=50 # Empty, but success == True

gannonprudhomme commented 5 years ago

Now that I think about it, how can we run the course searches asynchronously while still ensuring that the order of requests sent are like search1->reset1->search2->resest2, and so on? My implementation ran into the problem where it basically did search1->search2->reset1->reset2->etc(not in that literal order but something along these lines), so it frequently returned the incorrect list of courses.

Likewise, I tested your code and changed lines 73-78 in get_courses to:

       dept_retrieved = ''
        for info in data["data"]:
            dept_retrieved = info['subject']

        print(subject + ' retrieved ' + dept_retrieved)

and it printed

AEGD retrieved AEGD
DCED retrieved AEGD
AERO retrieved AEGD
GEOL retrieved AEGD
DPHS retrieved AEGD
GENE retrieved AEGD
EEBL retrieved AEGD
AERS retrieved AEGD
...

Note: I iterated through all of the subjects in TAMU to run the above, and passed them into grabber.fetch where you had ["MATH", "ELEC"]

aksiksi commented 5 years ago

Now that I think about it, how can we run the course searches asynchronously while still ensuring that the order of requests sent are like search1->reset1->search2->resest2, and so on?

Good catch.

I am assuming that these are searches under the same term? Because the session ID seems to be tied to each term.

We would need to sequence the subject/department tasks to run sequentially. So this would now look like this:

results = []
for task in tasks:
    results += await task

Does the search need to be reset in between as well? If yes, the search start and reset should move into that task.

gannonprudhomme commented 5 years ago

I am assuming that these are searches under the same term?

Yes they were.

Does the search need to be reset in between as well? If yes, the search start and reset should move into that task.

For me, simply resetting after every course search(as in each get_course call, rather than each search in your code) works for me.

If we're blocking between each request so they run sequentially, isn't using aiohttp redundant? While implementing the change you mentioned did indeed fix it, the run time of the aiohttp version of my code is identical to the one using requests.

The only way I can think of for getting around this is having multiple ClientSession objects(each with their own uniqueSessionId) and balancing the requests in between them. As in, each ClientSession object pulls a subject out of some buffer, gets the courses for that subject, resets the search, and repeats until the buffer is empty.

aksiksi commented 5 years ago

If we're blocking between each request so they run sequentially, isn't using aiohttp redundant?

Yes, with this change, only the terms are parallelized, so you'll only see a speed gain if grabbing multiple terms at once.

We can try having each search happen in get_courses with a new ClientSession for each course search. So a new uniqueSessionId would be created for each course search operation.

search() would then reduce to something like this:

    async def search(self, term, subjects):
        """Start a new Banner course search for a given term."""
        # Get all subjects for this term
        async with ClientSession() as client:
            if not subjects:
                subjects = [each["code"] for each in
                            await self.get_subjects(term, client)]

        # Get courses for all subjects in the term
        tasks = [self.get_courses(term, subject) for subject in subjects]
        results = []

        for result in await asyncio.gather(*tasks):
            results += result

        return results

I am not sure about the performance cost of creating a session for each subject/dept. If the cost is high, we might need to resort to using a fixed number of ClientSession instances.

EDIT

So I tried this approach out and it seems to work fine. It runs all subjects in parallel and the searches don't interfere with one another. I'm still stuck trying to get paging to work properly, though :/

gannonprudhomme commented 5 years ago

So for me, the get_subjects call is called in a completely different scenario and all of them are already saved as Django models, and thus the main focus of this part for me is get_courses. As such, instead of leaving for result in await asyncio.gather(*tasks for get_courses, I changed my search to only take one subject, which looks like this:

async def search(self, dept: str):
        loop = asyncio.get_running_loop()

        result = []
        async with aiohttp.ClientSession(loop=loop) as session:
            await self.create_session(session)

            result = await self.get_courses(session, department)

Then, I created a search_all function that gathered up all a collection of these tasks just like you have it for get_courses

 async def search_all(self, departments: List[str]):
      loop = asyncio.get_running_loop()
      tasks = [self.search(dept) for dept in departments]
      results = []

      for result in await asyncio.gather(*tasks, loop=loop):
          results.append(result)

       return results

This reduced the scraping of every course/section for a single term from 3-5min to ~45seconds

aksiksi commented 5 years ago

Good stuff! Do you scrape subjects/departments elsewhere? Or are you expecting them to remain static?

Edit: Oops, missed your first sentence 😄

gannonprudhomme commented 5 years ago

That's honestly something I hadn't considered, I just assumed that they would remain static(or at the very least, they would just add new ones, not modify existing one). Something I'll check into though.

Also may not necessarily pertain to you since I'm mainly doing it since saving the all of the models(Course, Section, Instructor, etc) to the database in Django takes like 45 seconds, but I may add some form of callback function to my get_courses. I would use it to add the data into some queue for a Django worker thread to then pop off from the queue and save it to the database while the other Banner requests are still running.

aksiksi commented 5 years ago

If you want more robust scraping + result handling and storage framework, I would suggest looking into Scrapy. Your “scraping” would consist of performing the necessary requests, and then you could use the DB backend to dump results as they come in.

The other approach would be to look for an async DB library so that you can just stream your results in the DB as they come in.

Granted, you probably won’t get to use your Django models directly in either case.

gannonprudhomme commented 5 years ago

Thanks! That's something I'll look into. I'll let you know what I end up doing.

Also, one thing I've noticed with my BannerRequests class is that when I attempt to scrape all 201 subjects, after it creates all the session objects and sends the GET to the course search URL is that I frequently get this output:

Unclosed connection
client_connection: Connection<ConnectionKey(host='compassxe-ssb.tamu.edu', port=443, is_ssl=True, ssl=None, proxy=None, proxy_auth=None, proxy_headers_hash=None)>

From what I can tell, the max amount of ClientSession objects that can exist at once are around 100. As such, I think that aiohttp is automatically closing other session objects in order to use another, so I believe we do need to limit the amount of current sessions that exist at once. That being said, it does seem to be working correctly regardless. I'll be looking into this further later today.

gannonprudhomme / AutoScheduler

Course data request sequencing #10