cindyheqy / Political_Polarization

gauge political polarization in the U.S. by analyzing legislative data
1 stars 0 forks source link

Speech data source #2

Open ngbdsb opened 1 year ago

ngbdsb commented 1 year ago

https://www.congress.gov/congressional-record/117th-congress/browse-by-date -- pdf https://heinonline-org.proxy.uchicago.edu/HOL/Index?index=congrec/crd&collection=congrec

ngbdsb commented 1 year ago

https://api.govinfo.gov/docs/ key: BsveU3Pa1wRgMsVrb6lNXkSdBub4OtEq0OMshRdp

ngbdsb commented 1 year ago

Request json in PY https://stackoverflow.com/questions/9733638/how-to-post-json-data-with-python-requests

cindyheqy commented 1 year ago

(senate) CREC-2018-01-04.pdf CREC-2018-01-04.txt (house) CREC-2019-02-04.pdf CREC-2019-02-04.txt

Hi I tried the api and extracted some pdf and converted them to txt. Here are two of them. Could you please take a look at the difference between the two? And think about these questions.

  1. Do we need more processing/cleaning before analyzing polarity? (ie. some words are split into letters)
  2. Do I need to save all pdf files as well as txt files? Or just keep txt files are fine? (Keep pdf can be helpful when we need to compare the two formats. the only problem is the file size. )
  3. How can we filter and keep only house debate content according to congressional legislation based on txt files?
  4. What else should I do before I extract all txt files?
ngbdsb commented 1 year ago
  1. Preferably yes, please push all the files and I can take a look.
  2. Both txt and pdf are good.
  3. All debate speech starts like "xxx recognizes xxx for xxx minutes".
cindyheqy commented 1 year ago
  1. pdf and txt file for one year is already too large. I will upload only txt file for the rest of the three years, since we only conduct text analysis on txt file. I will keep the code for getting pdf just in case we need some reference.
  2. if xx starts with ("Mr. xxx. Mr. speaker" or "Ms. xxx. Mr. speaker" or "Mr. xxx. Madam speaker" or "Ms. xxx. Madam.speaker"); ends with "THE SPEAKER"