daaronr / dr-rstuff

Helper files and functions David Reinstein uses across various data projects
0 stars 0 forks source link

Improve conversion powerpoints to md/rmd formats and vice-versa #5

Open daaronr opened 3 years ago

daaronr commented 3 years ago

try https://github.com/revan/pptx2md#powerpoint-to-markdown-converter?

oskasf commented 3 years ago

@daaronr By powerpoints i'm assuming that you mean the powerpoints in the data_acad_materials repo? Is the aim of conversion to extract the text or to keep the slide structure? Not sure if you have used the xaringan package but it seems to be a good way to have Rmd flavoured slides. This blog post shows a good way to embed such slideshows into blogdown so i'm sure there would be some way to embed these into a bookdown!

daaronr commented 3 years ago

By powerpoints i'm assuming that you mean the powerpoints in the data_acad_materials repo?

Yes, that is the immediate goal

Is the aim of conversion to extract the text or to keep the slide structure?

First to simply extract the text and convert it to markdown format, to use in the bookdown.

Not sure if you have used the xaringan package but it seems to be a good

way to have Rmd flavoured slides. This blog post https://timmastny.rbind.io/blog/embed-slides-knitr-blogdown/ shows a good way to embed such slideshows into blogdown so i'm sure there would be some way to embed these into a bookdown!

I've tried xaringan, will have another look. As you know I've mostly used reveal.js to do markdown-based html slides.

I'm trying to recall what the reason was I abandoned xaringan. Did it use a non-standard markdown syntax perhaps?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/daaronr/dr-rstuff/issues/5#issuecomment-751472851, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6ZCMH7FUA3RI24HBBRWRDSW46VVANCNFSM4VKAURWA .

gerhardriener commented 3 years ago

I used slidex to convert to xaringan https://rdrr.io/github/datalorax/slidex

Xaringan improved quite a bit over the last year. It uses as all presentation packages a markdown flavor but this seems no worse than the others. Xaringan integrates kind of nicely in the rstudio ecosystem

daaronr commented 3 years ago

slidex -- awesome!

I love markdown as you know, and Xaringan syntax does look pretty good. I think I had some trouble getting it to display local images; maybe that was the problem, but I guess it's fixed by now.

w

On Sun, Dec 27, 2020 at 3:47 PM gerhardriener notifications@github.com wrote:

I used slidex to convert to xaringan https://rdrr.io/github/datalorax/slidex

Xaringan improved quite a bit over the last year. It uses as all presentation packages a markdown flavor but this seems no worse than the others. Xaringan integrates kind of nicely in the rstudio ecosystem

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/daaronr/dr-rstuff/issues/5#issuecomment-751514162, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6ZCMHJOT727VN3RHTPWVTSW6MMJANCNFSM4VKAURWA .

oskasf commented 3 years ago

I couldn't get slidex to work myself, it would extract all the images from each but not the text.

Fortunately there is a Python module for dealing with Powerpoints. I have created a script to do the necessary conversion here

daaronr commented 3 years ago

Well done Oska, thanks!

On Sun, Dec 27, 2020 at 5:10 PM Oska Fentem notifications@github.com wrote:

I couldn't get slidex to work myself, it would extract all the images from each but not the text.

Fortunately there is a Python module for dealing with Powerpoints. I have created a script to do the necessary conversion here https://github.com/daaronr/data_acad_materials/blob/a911e8fa62da409b922a068f43e87178cc6ee062/code/convert_ppt.py

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/daaronr/dr-rstuff/issues/5#issuecomment-751521777, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6ZCMEBEEOHEPKZIUEUEL3SW6WFBANCNFSM4VKAURWA .

daaronr commented 3 years ago

https://github.com/ssine/pptx2md also seems promising ... but I can't get it to work

daaronr commented 3 years ago

@oskasf did you do the conversion with the script? where did you put these?

daaronr commented 3 years ago

@oskasf did you do the conversion with the script? where did you put these?

Wait, I see it now (moving between two repos, sorry)

daaronr commented 3 years ago

more Powerpoints to convert, but I can't get your @oskasf script to work. How can I run it? Note that I edited it for my file system. Of course ideally one uses only relative folder references.

$ py convert_ppt.py
bash: py: command not found
$ Python convert_ppt.py
Traceback (most recent call last):
  File "convert_ppt.py", line 1, in <module>
    from pptx import Presentation
ImportError: No module named pptx
$ python convert_ppt.py
Traceback (most recent call last):
  File "convert_ppt.py", line 1, in <module>
    from pptx import Presentation
ImportError: No module named pptx
$

$ convert_ppt.py
bash: convert_ppt.py: command not found
$ python3 convert_ppt.py
Traceback (most recent call last):
  File "convert_ppt.py", line 7, in <module>
    pres = Presentation('other_content_notes/powerpoint/big_data_management.pptx')
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pptx/api.py", line 28, in Presentation
    presentation_part = Package.open(pptx).main_document_part
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pptx/opc/package.py", line 125, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pptx/opc/pkgreader.py", line 33, in from_file
    phys_reader = PhysPkgReader(pkg_file)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pptx/opc/phys_pkg.py", line 32, in __new__
    raise PackageNotFoundError("Package not found at '%s'" % pkg_file)
pptx.exc.PackageNotFoundError: Package not found at 'other_content_notes/powerpoint/big_data_management.pptx'
oskasf commented 3 years ago

@daaronr You need to install the pptx module using pip install python-pptx. Note that your installation of Python must be >= Python3.0. You can check if this is satisfied using the command Python3 in a terminal window. If this isn't installed you can use homebrew to install

daaronr commented 3 years ago

yes, I think I took all these steps; note that I used 'python3' in the code above. Wait, maybe the 'P' needs a capital?

daaronr commented 3 years ago

I don't think the case matters here. See what happened below...

$ pip install python-pptx
Requirement already satisfied: python-pptx in /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages (0.6.18)
Requirement already satisfied: lxml>=3.1.0 in /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages (from python-pptx) (4.6.2)
Requirement already satisfied: XlsxWriter>=0.5.7 in /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages (from python-pptx) (0.9.3)
Requirement already satisfied: Pillow>=3.3.2 in /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages (from python-pptx) (7.0.0)
WARNING: You are using pip version 20.0.1; however, version 20.3.3 is available.
You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 -m pip install --upgrade pip' command.

$ Python3
Python 3.6.4 (v3.6.4:d48ecebad5, Dec 18 2017, 21:07:28)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> exit
Use exit() or Ctrl-D (i.e. EOF) to exit
>>> exit(
... )

$ Python3 convert_ppt.py
Traceback (most recent call last):
  File "convert_ppt.py", line 7, in <module>
    pres = Presentation('other_content_notes/powerpoint/big_data_management.pptx')
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pptx/api.py", line 28, in Presentation
    presentation_part = Package.open(pptx).main_document_part
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pptx/opc/package.py", line 125, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pptx/opc/pkgreader.py", line 33, in from_file
    phys_reader = PhysPkgReader(pkg_file)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pptx/opc/phys_pkg.py", line 32, in __new__
    raise PackageNotFoundError("Package not found at '%s'" % pkg_file)
pptx.exc.PackageNotFoundError: Package not found at 'other_content_notes/powerpoint/big_data_management.pptx'
oskasf commented 3 years ago

Ah, I don't think I wrote this correctly for it to be run through terminal (lack of absolute paths). If you open the file in Rstudio you should be able to run it

daaronr commented 3 years ago

If you have time to do the conversions for me that would be great. Otherwise I'll try it in Rstudio

On Sat, Jan 9, 2021 at 10:46 AM Oska Fentem notifications@github.com wrote:

Ah, I don't think I wrote this correctly for it to be run through terminal (lack of absolute paths). If you open the file in Rstudio you should be able to run it

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/daaronr/dr-rstuff/issues/5#issuecomment-757325198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB6ZCMDA5G34X472WT4K26DSZB24ZANCNFSM4VKAURWA .

oskasf commented 3 years ago

@daaronr Should be possible to run the file from terminal now. Simply drag the python executable into the folder with the powerpoints and execute the file, will create a new folder conv to put them in. Here

daaronr commented 3 years ago

It is still throwing errors:

$ Python3 convert_ppt.py
Traceback (most recent call last):
  File "convert_ppt.py", line 19, in <module>
    prs = Presentation(eachfile)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pptx/api.py", line 28, in Presentation
    presentation_part = Package.open(pptx).main_document_part
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pptx/opc/package.py", line 125, in open
    pkg_reader = PackageReader.from_file(pkg_file)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pptx/opc/pkgreader.py", line 33, in from_file
    phys_reader = PhysPkgReader(pkg_file)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pptx/opc/phys_pkg.py", line 32, in __new__
    raise PackageNotFoundError("Package not found at '%s'" % pkg_file)
pptx.exc.PackageNotFoundError: Package not found at '/Users/yosemite/githubs/data_acad_materials_gh_ver/other_content_notes/powerpoint/~$DS AI Training Outline v0.1.pptx'
$ pwd
/Users/yosemite/githubs/data_acad_materials_gh_ver/other_content_notes/powerpoint
$ ls
daaronr commented 3 years ago

@oskasf Did we ever solve this? I still cannot get it to run, same error as above.

daaronr commented 3 years ago

OK it is working for my purposes right now (moved file of interest to its own folder), but I'm still not sure what's going on in the error above.

daaronr commented 3 years ago

@oskasf Can it be adapted to also incorporate the 'speaker notes' into the .md in some way? Thanks

daaronr commented 3 years ago

We also seem to lose the images... any way to recover them?

oskasf commented 3 years ago

@daaronr Hm perhaps it may be easier to just use Slidex as this keeps images and writes a text file containing speakers notes. I'm not sure if it will be possible to fully automate this process so likely best to just use Slidex and check formatting for each file (I will start doing this now).