Hunter-Open-Source-Club / syllabi

Computer Science Syllabi Directory of Hunter College
http://syllabi.hunterosc.org/
MIT License
7 stars 22 forks source link

Feature Request: Create automated system for accepting URLs to non-PDF syllabi and converting to PDF #12

Open joshnatis opened 4 years ago

joshnatis commented 4 years ago

As of right now, we don't support syllabi that are in the form of links to websites. I actually think this is a good thing, because these syllabi are in a format that can change frequently (for example Saad clears his CS150 site every semester). PDF is definitely a superior format for the purposes of our project.

However, this means somebody has to either:

  1. manually convert each live syllabus to PDF and then submit it to us,
  2. just send us the link to the site and have us do the work of converting it, or
  3. accept defeat and not send it their syllabus cause they see we don't support links.

None of these are great alternatives.

So, I suggest we create some kind of process which allows users to submit links, which are then automatically converted into PDF format and submitted to this repo as a pull request. We can maybe do this using Github Actions?

rvente commented 3 years ago

There are several candidates for us to look into. The most clear path forward that I see is Using pandoc with GitHub Actions. I have extensive pandoc experience, but I'll have to do some digging to learn GitHub Actions. I will get back to you.

joshnatis commented 3 years ago

:O Thanks for coming back to this old issue, I completely forgot about it lol

I think that would work perfectly. I've actually used GitHub Actions for something recently so I can provide a basic annotated example (below). Also, here are some pandoc GitHub actions examples: https://github.com/pandoc/pandoc-action-example

name: Make Post

on:
  push: #run whenever something is pushed to the repo
  workflow_dispatch: #adds option to manually run the action in settings
  schedule: #run every 3 days at 11AM UTC time
    - cron: '0 11 */3 * *'

jobs:
  build:
    runs-on: ubuntu-latest #you have the choice of ubuntu, mac, and windows i think

    steps:
    - name: Checkout Repo
      uses: actions/checkout@v2  #checks out current repo
    - name: Create Post
      run: |-  #everything in here runs on the command line, do whatever you want
        cd programs
        ./post.sh

You basically just need to figure out how to get Pandoc installed in the environment Actions provides us with, besides that it's all normal stuff.

joshnatis commented 3 years ago

Oh, and to install that (so to say), you place it in .github/workflows/whatever.yml.

joshnatis commented 3 years ago

Update: we can use the wkhtmltopdf command-line tool for this task.

To get it installed in the fresh environment, I think our best option is to download a pre-compiled binary from here. Since (as far as I know) we don't know the type of architecture for the computer we're given, our best option may be to use themacos-latest virtual environment and download the MacOS binary (cause there's only one architecture).

We can download the binary with curl:

$ curl -L https://github.com/wkhtmltopdf/packaging/releases/download/0.12.6-2/wkhtmltox-0.12.6-2.macos-cocoa.pkg > wkhtmltopdf.pkg
$ # -L flag to follow redirects

Then we can install it to our home directory with installer:

$ installer -pkg wkhtmltopdf.pkg -target $HOME

Presumably after that we can just do wkhtmltopdf <link> <pdf>


Question: what should the link-submission process look like for a contributor? What would trigger this action, and how would the contributor communicate the URL and course information (course code, professor, semester) to us? One solution would be:

  1. Create a branch for link submissions, e.g. links
  2. Run the action when a pull request is made to links
  3. Submissions are made by having the contributor make a pull request to links with the required information provided in some pre-determined format (e.g. JSON or other).

So what we have so far is:

name: Link Submission to PDF

on:
  pull_request:
    branches:
      - links

jobs:
  build:
    runs-on: macos-latest

    steps:
    - name: Checkout Repo
      uses: actions/checkout@v2
    - name: Convert Link to PDF
      run: |-
        curl -L https://github.com/wkhtmltopdf/packaging/releases/download/0.12.6-2/wkhtmltox-0.12.6-2.macos-cocoa.pkg > wkhtmltopdf.pkg
        installer -pkg wkhtmltopdf.pkg -target $HOME
        ./linktopdf
    - name: Commit and push if changed
      run: |-
        git pull
        git add .
        git diff
        git config --global user.email "torvalds@linux-foundation.org"
        git config --global user.name "torvalds"
        git commit -m "Uploaded new syllabus" || echo "No changes to commit"
        git push

Where linktopdf is a script that would read and validate the contributor's submission (presumably in some pre-determined file), call wkhtmltopdf on the link, modify the _data/map.json and create an appropriate new directory in assets/courses/ if necessary, and move the PDF into its appropriate place.

What is left to be decided is the format of the file submitted by the contributor. It could be a self explanatory JSON file with fields like url, professor, semester, coursecode, and then we can write the script in python (installed by default on MacOS) using the JSON module.

Kind of getting to be a lot of work for something that may be harder for a contributor to do than just using some external service HTML to PDF converter service or built-in browser functionality.

rvente commented 3 years ago

Ah I see. I'll continue thinking about this. If anything comes up, that you think I could contribute, let me know. I'm happy to commit some time to it as the semester comes to a close.

rvente commented 9 months ago

Maybe one potential blocker is that we don't have many links to work with, so it's challenging to evaluate solutions precisely. Restating, the problem seems a tad underspecified, so this would at least give us a starting point. So starting with link collection might be a good next step. If we agree, I could add that to the README. We can start converting them case by case. Then we can automate it.