chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

read multiple text files from a folder #232

Closed sojohan closed 5 years ago

sojohan commented 5 years ago

Hi experts

I have a folder with some txt files that I would like to make into a corpus and then do topic modeling.

I have tried the following:

records = textacy.io.read_text('./DoctorsNotes', lines=True,mode='rt') for record in records: doc1 = textacy.Doc(record,lang=da) print(doc1)

I then get an error 'permission denied'. If I do one file it works……

records = textacy.io.read_text('./DoctorsNotes/text_sample.txt', lines=True,mode='rt') for record in records: doc1 = textacy.Doc(record,lang=da) print(doc1)

Doc(35 tokens; "Ved fremstilling af net til fixation i mould gj...") Doc(34 tokens; "Patienten klarede CT scanning uden problemer. D...") Doc(25 tokens; "Constraints til parotis (dxt+sin) kan ikke over...") Doc(36 tokens; "Ved plantjek vurderes begge planoplæg. Som note...") Doc(33 tokens; "Svært match ved nyopstiling. Patienten lå noget...") Doc(32 tokens; "Igen svært match ved nyopstilling. Fysiker igen...") Doc(90 tokens; "Konferencepatient: Vurderet på konference grund...")

But how to I get multiple files into a corpus?

Thanks,

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

bdewilde commented 5 years ago

Hi @sojohan , is DoctorsNotes the directory in which your text files are stored? The textacy.io.read_text() function only accepts the path to a single file on disk, not a directory. If you want to iterate over the files in a given directory, you can use textacy.io.get_filenames():

>>> for fname in textacy.io.get_filenames("./DoctorsNotes", extension=".txt"):
...     record = textacy.io.read_text(fname)