Closed rufuspollock closed 6 years ago
I created a script that extracts tables on pages 23-46, 67-74 and 78. The script requires PyPDF2, tabula and pandas modules as well as java. The script uses PyPDF2 module to read PDF file and split it into one page documents. Using tabula module, the script converts one page pdf documents that contain tables into csv files. Due to tables in pdf being not aligned horizontally and vertically, tabula has a problem in parsing a table with complete accuracy. Therefore some data wrangling is necessary. I have successfully parsed important tables except for "Table A1. 2014 CIRP Freshman Survey National Norms Sample and Population" on page 51 which comes out of tabula conversion very badly formatted. In my opinion, this table can't be parsed by this method and probably requires manual copy-paste wrangling.
I published a script in git repository https://github.com/Branko-Dj/the-american-freshman.
@Branko-Dj Great work! Few comments to improve:
├── archive
│ ├── name-of-original-file.pdf
│ ├── name-of-another-original-file.pdf
│ └── ...
├── data
│ ├── name-of-your-first-resource.csv
│ ├── name-of-your-second-resource.csv
│ └── ...
├── datapackage.json
├── README.md
└── scripts
├── Makefile
└── process.py
Note: By the resource, I mean normalized and tidy CSV file
The American Frashman
probably is not the best name for dataset (nearly says nothing to me)@zelima Thank you for your feedback. I uploaded all csv files and a datapackage that I created. A few issues I am fixing at the moment:
The issue that I ran into:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 344: character maps to \<undefined>
@Branko-Dj
re
datapackage-py won't parse csv files page_38 - page_43 when creating a datapackage. Because of this I had to use data-cli tool which encountered no problems to create datapackage.json. This seems to be due to some encoding error.
Not sure why do you need datapackage-py
here at all. Can you provide more info?
Not sure why do you need datapackage-py here at all. Can you provide more info?
In order to create a datapackage I wanted to use python module datapackage-py as it provides an ability to add description when creating a package. Here's a snippet of code:
csvPathList = list(map(lambda x: x.strip(), glob.glob('*.csv')))
csvPathList.sort()
for csvFilePath in csvPathList:
package.infer(csvFilePath)
But it reports an error described above. It simply won't infer pages 38-43. So instead I created the datapackage.json using data-cli and added description afterwards using script modifyDatapackage.py which imports datapackage-py module. This solution works for all pages with no errors but I would like to be able to create a datapackage without the need to resort to data-cli. It seems to me that it would be simpler and a more elegant solution if the datapackage could be created using only one script and not needing to first create it with data-cli and then to use a modification script.
@Branko-Dj you can add description manually if that's taking too much of your time.
Note: You do not have to script everything (just the part that extracts data from PDF). You can create datapackage.json manually anytime. Those are just tools (datapackage-py, data-cli) that should help you with packaging stuff. If they are not helping you, better to ignore them I think.
@zelima Ok here are the changes I made:
I believe that this issue can be closed now.
@Branko-Dj Excellent work! Few more details and we can transfer the ownership under datasets org.
pip install pandas
pip install other-third-party-lib
...
freshmen_survey.csv
""CRIP Freshmen Survey"
looks better than "cirp-freshmen-survey"
(in datapackage.json)-
instead of _
- freshmen-survey
instead of freshmen_survey
https://github.com/Branko-Dj/cirp-survey-of-freshmen/blob/master/datapackage.json#L8@zelima I updated everything. Is it good now? https://github.com/Branko-Dj/cirp-survey-of-freshmen
FIXED, dataset is available on datahub https://github.com/datasets/cirp-survey-of-freshmen, also updated registry by adding into core-list.csv
@Branko-Dj please include datahub links as well for future.
Also available on datahub: https://datahub.io/core/cirp-survey-of-freshmen
Want to extract table from p.24 onwards: https://www.heri.ucla.edu/monographs/TheAmericanFreshman2014.pdf
Ultimately would like time series of years (can't find on website but must be there).
Especially interested in p.44 and what freshmen think are important.
cf http://rufuspollock.com/2008/08/29/money-has-grown-in-importance-to-us-freshmen-since-the-60s/