Google Docs and Google Slides importer

jkomoros commented 1 year ago

Ideally it would be possible to enumerate some Google Docs and Google Slides you own and have it import the content.

I'd love to have for example https://komoroske.com/gardening-platforms and https://komoroske.com/slime-mold in it.

For slides, it can just select any text runs and also speaker notes.

For docs it should be straightforward.

jkomoros commented 1 year ago

I asked GPT what the code should be:

function extractTextFromSlides() {
  var presentation = SlidesApp.getActivePresentation();
  var slides = presentation.getSlides();
  var text = "";

  for (var i = 0; i < slides.length; i++) {
    var slide = slides[i];
    var elements = slide.getPageElements();

    for (var j = 0; j < elements.length; j++) {
      var element = elements[j];

      if (element.getPageElementType() == SlidesApp.PageElementType.SHAPE) {
        var shape = element.asShape();
        var shapeText = shape.getText();

        if (shapeText) {
          var paragraphs = shapeText.getParagraphs();

          for (var k = 0; k < paragraphs.length; k++) {
            var paragraph = paragraphs[k];
            var runs = paragraph.getRuns();

            for (var l = 0; l < runs.length; l++) {
              var run = runs[l];
              text += run.getText();
            }
          }
        }
      } else if (element.getPageElementType() == SlidesApp.PageElementType.TABLE) {
        var table = element.asTable();
        var rows = table.getRows();

        for (var k = 0; k < rows.length; k++) {
          var row = rows[k];
          var cells = row.getCells();

          for (var l = 0; l < cells.length; l++) {
            var cell = cells[l];
            var cellText = cell.getText();

            if (cellText) {
              var paragraphs = cellText.getParagraphs();

              for (var m = 0; m < paragraphs.length; m++) {
                var paragraph = paragraphs[m];
                var runs = paragraph.getRuns();

                for (var n = 0; n < runs.length; n++) {
                  var run = runs[n];
                  text += run.getText();
                }
              }
            }
          }
        }
      }
    }
    var speakerNotes = slide.getNotesPage().getSpeakerNotesShape().getText();
    if(speakerNotes)
      text += speakerNotes.getText()
  }
  Logger.log(text);
}

dglazkov commented 1 year ago

Ain't bad.

dglazkov commented 1 year ago

I wonder if this might be a good approach: https://developers.google.com/docs/api/samples/extract-text#python

jkomoros commented 1 year ago

[x] Document the importer in README
[ ] Add a Google Slides importer
[ ] Support importing all files within a Drive folder
[ ] Add an ability to only output docs that are viewable to anyone with a link, vs require an ACL. (Similar to the Medium importer's --medium-include={all, draft, published}) In the future maybe allow specifying precisely which userID must have an ACL to be output?
[ ] Chunk doc content based on headings. Use the deep-link URL for each heading, so deep links more effectively link to the content being used
[ ] Is there a way to accept developer information from CLI to create credentials.SECRET.json on first run?
[ ] Move crenentials/tokens.SECRET.json to a name specific to google

dglazkov / polymath

Google Docs and Google Slides importer #39