MyMICDS / MyMICDS-v2

Back-end REST API for MyMICDS.net
https://mymicds.net
GNU Affero General Public License v3.0
8 stars 5 forks source link

Parse the Daily Bulletin for an Assortment of Features! #39

Open michaelgira23 opened 7 years ago

michaelgira23 commented 7 years ago

The Daily Bulletin is an email that the entire Upper School receives every day before school. The bulletin is a PDF contains the day's schedule, news from around the school, the lunch, birthdays, and more. On MyMICDS, we query my email and automatically download the Daily Bulletin to put on MyMICDS. While this is useful for people who don't organize their email or don't want to log into their email, it is possible for us to leverage this even more.

It is possible to parse and extract information from the PDF which opens up so many possibilities. Here are a few:

Possibilities

Special Days

You know why there are so many people who where blazers after formal dress day? Because that's their punishment for forgetting. On the header/title of the bulletin, it will usually say whether or not it's formal dress. We can then have an email notification system remind people when it's formal dress the next day. The bulletin also has other holidays besides formal dress.

Parse Announcements

We can get the announcements from the bulletin and add an announcement module in our upcoming modules system. By extracting the text ourselves, we can style the text and integrate it smoothly into our interface and add our own announcements.

Field Trips / Early Dismissal

We could also separate the field trips and early dismissals from the regular announcements. Bonus points if we can highlight which ones are relevant to the user.

Parse Birthdays

Wish our fellow students (and teachers) a happy birthday. Preferably also change their background to this gif I made a long time ago in v1 for Alexander's birthday. However, this works as well.

Lunch

Lunch isn't too terribly important because we already get data from the school lunch website, but it wouldn't hurt to have a point of redundancy to fall back to in case the lunch website is down or something.

Schedule

This is probably the most ambitious out of all of the possibilities, but if we're successful, it could be one of the most useful. If we're able to parse the schedule in the Daily Bulletin, then we can have more redundancy and rely less on the Portal. Currently, if the Portal goes down, then MyMICDS is screwed when it comes to displaying the schedule, which is one of the main features of the site. Also, the bulletin sometimes has a more detailed schedule (usually special assemblies or activities are just labelled "Advisory").

What makes the schedule so complex is when different grades/classes have different things. For example, on Day 1, Science/Art/Math have first lunch and class second and vice versa for other classes. Special schedules are hard in general, and we'd have to parse keywords to determine which demographic each entry belongs to in the schedule.

Clubs

We can find out which clubs are meeting in which room with which teacher. We could compile a list of all the clubs and have users select any they are in. We can add notifications if they're club is meeting, and even insert it into their schedule automatically.

Challenges

While all of these things sound awesome, it ain't easy.

As far as I know, the Daily Bulletin is made manually, by humans. It visually looks similar, but when attempting this task last year in v1, I noticed several nuances (1 line break separating the announcements instead of 2, etc.) Text is messy, so it's going to be hard to parse it. We'd have to expect anything could change, and compensate for it. It will be important to look at all the previous archived bulletins (since my freshman year!) and make sure every single one of them parses correctly.

Solutions

I've attempted to parse the Daily Bulletin back in MyMICDS-v1 (php/parse_bulletin.php) but didn't have much success. I used a PDF -> Text converter, which meant I could only work with a string. However, the Daily Bulletin uses different styles like bold/underlined, center alignment, etc. that can't be represented in a simple string. Headers and titles were very hard to distinguish without stylings. That's why I'd recommend using a library like pdf2json which gives a lot more data to work with.

michaelgira23 commented 6 years ago

If we continue with parsing the bulletin, an anomaly to consider is the bulletin from September 6th, 2018 because of the wacky schedule https://mymicds.net/daily-bulletin/2018-09-06

nickbclifford commented 5 years ago

So, here's the thing. When Ms. O'Brien took over as assistant for Mr. Calise this year, the Daily Bulletin format was changed completely. Now, some things are inserted as images and not everything is displayed as normal text anymore. We might still be able to do some sort of parsing, but it'll certainly be significantly more difficult. Might be something to consider tackling during a long break, but certainly not a priority.