produce a single PDF of course site

djplaner commented 3 years ago

Produce method to generate a single PDF of a given set of Content Interface pages

Early design

operation

In a single O365 shared folder

Spreadsheet that defines the list of pages (and order) to merge into PDFs
- Could have different sheets for different PDFs - the name of the sheet is name of single PDF
sub-folder - temp holding space for downloaded PDFs
single merged PDFs go into the top level - shared public read only

The python script (or eventually other)

for each sheet spreadsheet
- for each page in sheet
  - Visit the page
  - use the download PDF button to save a PDF
- merge those downloaded PDFs into a single PDF

Process

[x] Does Python provide a way to merge separate PDFs
- [x] test with CMM19
[ ] Implement full spreadsheet spider
[x] Add bookmarks for headings and title
- [ ] Provide some advice on how to view bookmarks
[x] Fix CI print facility - doesn't appear to be updating the Blackboard Content/Menu links
[x] Produce PDFs using Chrome to make sure all links are active

To do

[ ] Can spider be done semi-manually - using listContent.jsp links?
[ ]

djplaner commented 3 years ago

Technology options

PyPDF2 - PdfFileMerger or this and this

Reading content

Question is whether we can add additional bookmarks based on headings. To do this we need to be able to read content on a page and determine if we should add a bookmark for that page

PyPDF2 - can extract content of page, but not formatting.
PDFMiner.six appears able to extract particular elements, but require more information about the nature of elements in order to identying headings
- this page provides some help
this page outlines an approach using PyMuPDF

That last one has a function that parses a PDF document and generates a JSON data structure breaking the PDF up and identifying headings. Including the ones I'm after.

PyMuPDF also has functionality that generates ToC.

djplaner commented 3 years ago

Different pages printing with different CSS.

Problem may be that CHrome has two separate ways to produce PDFs. Adobe and internal. NOPE

TOpic 1 - no bookmarks
- h1 12.24 font
- com14_print.css
- font-size: 120%
TOpic 3 - bookmarks
- h1 14.41 font
- css for print is com14_print.css
- fotn-size 120%

djplaner commented 3 years ago

Handling long headings

[x] Complete

LHS34 study guide topic 5 has a long heading. The extract headings function is getting multiple headings throwing out the ToC

Probably because long headings are spread over lines/spans. Rather than just one. Currently eaching line/space is creating a heading.

djplaner commented 3 years ago

Still issues with variable font sizes

[x] Complete

LHS34 assessment PDF is generating different font sizes which means that extract headings is having issues

Either

[x] Solve the generation of PDFs with different font sizes - Not possible appears to happen due to different content on the page influencing how the "save as pdf" functionality sizes the fonts
[x] extractHeadings auto identifies the top X biggest fonts and uses those

Identifying font sizes

Different PDF files have different font sizes. But extractHeadings works on a completed document. Luckily the font sizes are uniquely strange - lots of decimal points.

This needs to be called on each individual file to make it easier to distinguish. Could even include the content for the chapter

headingFontSizes = extractChapterHeadingFontSizes( doc )

djplaner commented 3 years ago

Remove review status on messages

PDFs are generated with edit mode on. This adds some standard messages re: hidden, review status etc. Remove these.

djplaner / Content-Interface-Tweak