CityOfNewYork / CROL-Overview

City Record Online parsing libraries and supporting files
26 stars 14 forks source link

Parsing AdditionalDescriptions #21

Closed mattalhonte closed 9 years ago

mattalhonte commented 9 years ago

Extracting stuff to put into variables 26-30 in the schema: https://docs.google.com/spreadsheets/d/1str6vjjHS5EA_2ww9r4WjHA1t32Z00uLLbviegTc8WI/edit#gid=1430366155

Current next step(s):

cds-amal commented 9 years ago

I created a corpora from all csvs in the sample database, that scraped 'additionaldescription' and removed html artifacts. Perhaps it would be helpful.

To use it you would extract it, and append the crow folder path to your nltk.data.path

>>> import os
>>> import os.path
>>> from nltk.corpus.reader.plaintext import PlaintextCorpusReader

>>> corpusdir = os.path.join(os.getcwd(),'nltk_data', 'crow')
>>> crow = PlaintextCorpusReader(corpusdir, '.*')
>>> crow
<PlaintextCorpusReader in '/Volumes/Sofai/python-projects/nltk3/nltk_data/crow'>
>>> crow.fileids()
["Administration for Children's Services.txt", 'Aging.txt', 'Board Meetings.txt', 'Board of Correction.txt', 'Board of Education Retirement System.txt', 'Board of Standards and Appeals.txt', 'Borough President - Bronx.txt', 'Borough President - Brooklyn.txt', 'Borough President - Manhattan.txt', 'Borough President - Queens.txt', 'Brooklyn Bridge Park.txt', 'Build NYC Resource Corporation.txt', 'Business Integrity Commission.txt', 'Campaign Finance Board.txt', 'Chief Medical Examiner.txt', 'City Council.txt', 'City Planning Commission.txt', 'City Planning.txt', 'City Record.txt', 'City University.txt', 'Citywide Administrative Services.txt', 'Community Boards.txt', 'Comptroller.txt', 'Conflicts of Interest Board.txt', 'Consumer Affairs.txt', 'Correction.txt', 'Design Commission.txt', 'Design and Construction.txt', 'District Attorney - Bronx County.txt', 'District Attorney - New York County.txt', 'Economic Development Corporation.txt', 'Education.txt', "Employees' Retirement System.txt", 'Environmental Control Board.txt', 'Environmental Protection.txt', 'Equal Employment Practices Commission.txt', 'Finance.txt', 'Financial Information Services Agency.txt', 'Fire Department.txt', 'Health and Hospitals Corporation.txt', 'Health and Mental Hygiene.txt', 'Homeless Services.txt', 'Housing Authority.txt', 'Housing Preservation and Development.txt', 'Hudson River Park Trust.txt', 'Human Resources Administration.txt', 'Industrial Development Agency.txt', 'Information Technology and Telecommunications.txt', 'Landmarks Preservation Commission.txt', 'Law Department.txt', 'Loft Board.txt', "Mayor's Fund to Advance New York City.txt", "Mayor's Office of Contract Services.txt", "Mayor's Office of Criminal Justice.txt", "Mayor's Office of Environmental Remediation.txt", 'NYC & Company.txt', 'Office of Emergency Management.txt', 'Office of Labor Relations.txt', 'Office of Management and Budget.txt', 'Office of the Mayor.txt', 'Parks and Recreation.txt', 'Police.txt', 'Probation.txt', 'Public Library - Queens.txt', 'Sanitation.txt', 'School Construction Authority.txt', 'Supreme Court.txt', 'Taxi and Limousine Commission.txt', 'Transportation.txt', 'Trust for Governors Island.txt', 'Youth and Community Development.txt']
>>> corpusText = nltk.Text(crow.words())
>>> corpusText
<Text: NOTICE IS HEREBY GIVEN that a Public Hearing...>
mattalhonte commented 9 years ago

Awesome!

mattalhonte commented 9 years ago

So, we've now got a corpus with Part-Of-Speech data in it. Direct link: https://github.com/mattalhonte/CROL-PDF/blob/master/nltkStuff/taggedCorpus.txt

Here's the process of making one: http://nbviewer.ipython.org/gist/mattalhonte/0a789fb50414be833ae4

Here's how you load it up. After moving to the relevant folder...

import pandas as pd
import nltk
from nltk.tag.util import tuple2str
import os
import os.path
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.corpus.reader.tagged import TaggedCorpusReader

readTagged = TaggedCorpusReader(os.getcwd(), 'taggedCorpus.txt', sent_tokenizer=nltk.RegexpTokenizer(""".\.""", gaps=True))

Poked around a little at the tagged data: http://nbviewer.ipython.org/gist/mattalhonte/1a677b4c25081383fae4

Also, made a little guide to some of the built-in methods in NLTK's Text objects: http://nbviewer.ipython.org/gist/mattalhonte/cc76f05c67dbf8e148c7