Course unit not being scraped

NIAEFEUP / uporto-schedule-scrapper

Python solution to extract the courses schedules from the different faculties of UPorto. Used to feed our timetable selection platform for students, TTS.

GNU General Public License v3.0

3 stars 2 forks source link

Course unit not being scraped #37

Closed bernardobelchior closed 1 year ago

bernardobelchior commented 6 years ago

This course unit isn't being scraped. I still don't know if the problem is with the scraper or with sigarra itself.

miguelpduarte commented 5 years ago

The updated CSS selector for acronyms ('#conteudoinner > table:first-of-type tr > td:last-child') is fetching the Nível: 100 (the 100 part).

Not sure if this is the problem but it's not correct anyway.

imnotteixeira commented 5 years ago

One possible solution is to "select" the element by refering to its brother elem which always has the text "Sigla:". Maybe this can be done with a xpath selector instead of css only (perhaps regex?)

miguelpduarte commented 5 years ago

Me and @imnotteixeira found out this xpath selector that can be used in that linked course unit with success, while also working in the previous examples: //div[@id="conteudoinner"]/table[@class="formulario"][1]//td[text()="Sigla:"]/following-sibling::td[1]/text()

The xpath should replace the CSS selector here: https://github.com/NIAEFEUP/uporto-timetable-scrapper/blob/master/scrapper/scrapper/spiders/course_unit_spider.py#L110

Note: Do not forget the extract first.

The line would be as such then:

acronym = response.xpath("//div[@id="conteudoinner"]/table[@class="formulario"][1]//td[text()="Sigla:"]/following-sibling::td[1]/text()").extract_first()

Later today will test this and update here with the results.

miguelpduarte commented 5 years ago

After switching this line out as stated above and running the scrapper, it appears that a couple more course units were fetched, and many others now have the correct acronym.

However, despite there being a row in the course_units table representing this course unit, and the foreign key pointing to the "Licenciatura em Arquitetura Paisagista" course (one of the courses that this course_unit belongs to), this does not appear as an option in the front-end.

In order for the course unit to be associated to the several courses it belongs to, maybe the table in this course unit that displays what courses it is lectured to has to be parsed, as I cannot as of yet find an other way to do this.

I am now going to investigate why this option does not appear in the front-end despite the correct row existing in the database.

miguelpduarte commented 5 years ago

Update:

When issuing the correct query to the API: /courses/175/2018/2/units the unit is correctly listed.

~Therefore, this issue is probably on the side of the front-end that is not issuing the correct request.~ I was mistakenly selecting the wrong semester... Whoops! :sweat_smile:

So, the course unit is now correctly being scrapped, and the only thing missing is to somehow enable adding course units to several courses, for units that are lectured to several courses.

imnotteixeira commented 5 years ago

Just letting you guys know that this selector was giving a syntax error due to the use of "". Instead it should be :

acronym = response.xpath('//div[@id="conteudoinner"]/table[@class="formulario"][1]//td[text()="Sigla:"]/following-sibling::td[1]/text()').extract_first()

On another note, I ran the scrapper with this selector, and the results were good, meaning that most of the "problematic" course_units were fetched and some more schedules were also fetched (might, however not be correlated with this selector, being only dependent of their release in sigarra)

miguelpduarte commented 5 years ago

Good results all around then, presumably

Maybe this warrants making a PR? Any takers?