datasets / awesome-data

Curated list of quality open datasets
https://datahub.io/collections
755 stars 91 forks source link

The American Freshman #210

Closed rufuspollock closed 6 years ago

rufuspollock commented 7 years ago

Want to extract table from p.24 onwards: https://www.heri.ucla.edu/monographs/TheAmericanFreshman2014.pdf

Ultimately would like time series of years (can't find on website but must be there).

Especially interested in p.44 and what freshmen think are important.

cf http://rufuspollock.com/2008/08/29/money-has-grown-in-importance-to-us-freshmen-since-the-60s/

Branko-Dj commented 6 years ago

I created a script that extracts tables on pages 23-46, 67-74 and 78. The script requires PyPDF2, tabula and pandas modules as well as java. The script uses PyPDF2 module to read PDF file and split it into one page documents. Using tabula module, the script converts one page pdf documents that contain tables into csv files. Due to tables in pdf being not aligned horizontally and vertically, tabula has a problem in parsing a table with complete accuracy. Therefore some data wrangling is necessary. I have successfully parsed important tables except for "Table A1. 2014 CIRP Freshman Survey National Norms Sample and Population" on page 51 which comes out of tabula conversion very badly formatted. In my opinion, this table can't be parsed by this method and probably requires manual copy-paste wrangling.

I published a script in git repository https://github.com/Branko-Dj/the-american-freshman.

zelima commented 6 years ago

@Branko-Dj Great work! Few comments to improve:

Note: By the resource, I mean normalized and tidy CSV file

Branko-Dj commented 6 years ago

@zelima Thank you for your feedback. I uploaded all csv files and a datapackage that I created. A few issues I am fixing at the moment:

  1. Creating good instructions about running a script and necessary installations
  2. Updating README file to provide better information about the data
  3. Trying to figure out a way to simplify running the scripts, and to avoid installing third party packages

The issue that I ran into:

zelima commented 6 years ago

@Branko-Dj

  1. Good (but don't overthink)
  2. Great
  3. This is not a must if there's no other way let's keep them. (I would not spent more than 30-60 minutes on that)

re

datapackage-py won't parse csv files page_38 - page_43 when creating a datapackage. Because of this I had to use data-cli tool which encountered no problems to create datapackage.json. This seems to be due to some encoding error.

Not sure why do you need datapackage-py here at all. Can you provide more info?

Branko-Dj commented 6 years ago

Not sure why do you need datapackage-py here at all. Can you provide more info?

In order to create a datapackage I wanted to use python module datapackage-py as it provides an ability to add description when creating a package. Here's a snippet of code:

csvPathList = list(map(lambda x: x.strip(), glob.glob('*.csv')))
csvPathList.sort()
for csvFilePath in csvPathList:
    package.infer(csvFilePath)

But it reports an error described above. It simply won't infer pages 38-43. So instead I created the datapackage.json using data-cli and added description afterwards using script modifyDatapackage.py which imports datapackage-py module. This solution works for all pages with no errors but I would like to be able to create a datapackage without the need to resort to data-cli. It seems to me that it would be simpler and a more elegant solution if the datapackage could be created using only one script and not needing to first create it with data-cli and then to use a modification script.

zelima commented 6 years ago

@Branko-Dj you can add description manually if that's taking too much of your time.

Note: You do not have to script everything (just the part that extracts data from PDF). You can create datapackage.json manually anytime. Those are just tools (datapackage-py, data-cli) that should help you with packaging stuff. If they are not helping you, better to ignore them I think.

Branko-Dj commented 6 years ago

@zelima Ok here are the changes I made:

  1. I changed the name of the repository. It is now more informative and it's link is https://github.com/Branko-Dj/cirp-survey-of-freshmen
  2. I updated README.md file. I believe it now describes the data appopriately
  3. I now have only one script called process.py which extracts data.
  4. I changed the number of CSV files from 31 to 3 as I concatenated those CSVs that had the same names of columns. This will make things simpler
  5. I also validated the package using data-cli.

I believe that this issue can be closed now.

zelima commented 6 years ago

@Branko-Dj Excellent work! Few more details and we can transfer the ownership under datasets org.

Branko-Dj commented 6 years ago

@zelima I updated everything. Is it good now? https://github.com/Branko-Dj/cirp-survey-of-freshmen

Branko-Dj commented 6 years ago

FIXED, dataset is available on datahub https://github.com/datasets/cirp-survey-of-freshmen, also updated registry by adding into core-list.csv

zelima commented 6 years ago

@Branko-Dj please include datahub links as well for future.

Also available on datahub: https://datahub.io/core/cirp-survey-of-freshmen