UChicago-CCA-2021 / Frequently-Asked-Questions

Repository to ask questions - please use the issues page to ask your questions.
0 stars 0 forks source link

How can we read PDF and Word documents saved locally? #9

Open acozzubo opened 3 years ago

acozzubo commented 3 years ago

Hello,

By doing HW1, I was wondering how can we read PDF and Word docs save locally.

The examples in the code used online documents and when I tried changing the URLs for folder paths it did not work.

Thanks!

PD. Is this the right place to ask this question? Should we have a Piazza for this?

bhargavvader commented 3 years ago

Hello @acozzubo , the same packages we used also allow us to extract text from local installations. Stack overflow is your best friend for such questions:

1) Python-docx: https://stackoverflow.com/questions/25228106/how-to-extract-text-from-an-existing-docx-file-using-python-docx

2) pdfminer: https://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python

You can also repurpose the existing code in the jupyter notebook a little bit to get it work for local files. Hopefully this should help!

(And yes, this is the right place for such questions - we're not using Piazza for this course, all the content is on Canvas + GitHub!)

joshuabsilver commented 3 years ago

Similar question - I cannot get past the error message for using the Shakespeare corpus in HW1. Should we download these files locally and extract from there? Or is there a way to pull them directly from the Jupyter notebook? Running this: targetDir = 'Homework-Notebooks/data/Shakespeare' Does not work

bhargavvader commented 3 years ago

@joshuabsilver , when you cloned the Homework Notebooks repository, it should have downloaded the Data (including the Shakespeare files). If not, you can manually download it from the repository and try again - I am not sure what you mean when you say pull directly from the Jupyter notebook, though - could you elaborate on that? When you run the code with that data on your local, that command should work fine.