Closed mattalhonte closed 9 years ago
I created a corpora from all csvs in the sample database, that scraped 'additionaldescription' and removed html artifacts. Perhaps it would be helpful.
To use it you would extract it, and append the crow folder path to your nltk.data.path
>>> import os
>>> import os.path
>>> from nltk.corpus.reader.plaintext import PlaintextCorpusReader
>>> corpusdir = os.path.join(os.getcwd(),'nltk_data', 'crow')
>>> crow = PlaintextCorpusReader(corpusdir, '.*')
>>> crow
<PlaintextCorpusReader in '/Volumes/Sofai/python-projects/nltk3/nltk_data/crow'>
>>> crow.fileids()
["Administration for Children's Services.txt", 'Aging.txt', 'Board Meetings.txt', 'Board of Correction.txt', 'Board of Education Retirement System.txt', 'Board of Standards and Appeals.txt', 'Borough President - Bronx.txt', 'Borough President - Brooklyn.txt', 'Borough President - Manhattan.txt', 'Borough President - Queens.txt', 'Brooklyn Bridge Park.txt', 'Build NYC Resource Corporation.txt', 'Business Integrity Commission.txt', 'Campaign Finance Board.txt', 'Chief Medical Examiner.txt', 'City Council.txt', 'City Planning Commission.txt', 'City Planning.txt', 'City Record.txt', 'City University.txt', 'Citywide Administrative Services.txt', 'Community Boards.txt', 'Comptroller.txt', 'Conflicts of Interest Board.txt', 'Consumer Affairs.txt', 'Correction.txt', 'Design Commission.txt', 'Design and Construction.txt', 'District Attorney - Bronx County.txt', 'District Attorney - New York County.txt', 'Economic Development Corporation.txt', 'Education.txt', "Employees' Retirement System.txt", 'Environmental Control Board.txt', 'Environmental Protection.txt', 'Equal Employment Practices Commission.txt', 'Finance.txt', 'Financial Information Services Agency.txt', 'Fire Department.txt', 'Health and Hospitals Corporation.txt', 'Health and Mental Hygiene.txt', 'Homeless Services.txt', 'Housing Authority.txt', 'Housing Preservation and Development.txt', 'Hudson River Park Trust.txt', 'Human Resources Administration.txt', 'Industrial Development Agency.txt', 'Information Technology and Telecommunications.txt', 'Landmarks Preservation Commission.txt', 'Law Department.txt', 'Loft Board.txt', "Mayor's Fund to Advance New York City.txt", "Mayor's Office of Contract Services.txt", "Mayor's Office of Criminal Justice.txt", "Mayor's Office of Environmental Remediation.txt", 'NYC & Company.txt', 'Office of Emergency Management.txt', 'Office of Labor Relations.txt', 'Office of Management and Budget.txt', 'Office of the Mayor.txt', 'Parks and Recreation.txt', 'Police.txt', 'Probation.txt', 'Public Library - Queens.txt', 'Sanitation.txt', 'School Construction Authority.txt', 'Supreme Court.txt', 'Taxi and Limousine Commission.txt', 'Transportation.txt', 'Trust for Governors Island.txt', 'Youth and Community Development.txt']
>>> corpusText = nltk.Text(crow.words())
>>> corpusText
<Text: NOTICE IS HEREBY GIVEN that a Public Hearing...>
Awesome!
So, we've now got a corpus with Part-Of-Speech data in it. Direct link: https://github.com/mattalhonte/CROL-PDF/blob/master/nltkStuff/taggedCorpus.txt
Here's the process of making one: http://nbviewer.ipython.org/gist/mattalhonte/0a789fb50414be833ae4
Here's how you load it up. After moving to the relevant folder...
import pandas as pd
import nltk
from nltk.tag.util import tuple2str
import os
import os.path
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.corpus.reader.tagged import TaggedCorpusReader
readTagged = TaggedCorpusReader(os.getcwd(), 'taggedCorpus.txt', sent_tokenizer=nltk.RegexpTokenizer(""".\.""", gaps=True))
Poked around a little at the tagged data: http://nbviewer.ipython.org/gist/mattalhonte/1a677b4c25081383fae4
Also, made a little guide to some of the built-in methods in NLTK's Text objects: http://nbviewer.ipython.org/gist/mattalhonte/cc76f05c67dbf8e148c7
Extracting stuff to put into variables 26-30 in the schema: https://docs.google.com/spreadsheets/d/1str6vjjHS5EA_2ww9r4WjHA1t32Z00uLLbviegTc8WI/edit#gid=1430366155
Current next step(s):