Open herrtao opened 8 years ago
@herrtao Wow, somehow this completely slipped by me -- apologies for not responding.
Tethne is primarily designed for cases where you are starting with bibliographic metadata (e.g. from Web of Science, JSTOR, Zotero). If you're just working with a bunch of plain-text files, then there are potentially simpler approaches.
As a starting-place, you might take a look at the notebooks in this project. There are several different workflows -- in the topic modeling sections, there are notebooks that demonstrate LDA with Tethne/MALLET and gensim. In particular, this notebook demonstrates LDA with gensim -- if you don't have metadata, you can just skip/comment out those parts.
I hope that helps! Let me know if you have any other questions. We can also discuss further off-channel if you'd prefer (erick.peirson@asu.edu).
This will be TETHNE-131
@herrtao Take a look at this thread for a related discussion. It's not exactly what you asked, but maybe helpful.
@herrtao Ok, as of v0.8.1.dev5 this is now a feature! Since this is a pre-release version you'll have to upgrade Tethne with the --pre flag.
pip install -U tethne --pre
Here's an example. Please let me know what you think. If you run into issues, or have other requests, please check out our new Q/A group.
>>> from tethne.readers.plain_text import read
>>> corpus = read('/path/to/directory/with/texts')
To use the corpus for topic modeling, you could then do:
>>> model = LDAModel(corpus, featureset_name='plain_text')
>>> model.fit(Z=5, max_iter=200)
More documentation will be forthcoming, but here's the docstring for now:
Generate a :class:`.Corpus` from a collection of plain-text files.
Plain-text content will be available as a feature set called "plain_text".
Uses :class:`nltk.corpus.reader.plaintext.PlaintextCorpusReader`\.
Parameters
----------
path : str
Path to a directory containing plain text files.
pattern : str
(default: '.+\.txt') A RegEx pattern used to select texts for inclusion
in the corpus. By default will select any file ending in `.txt`.
extractor : function
This function can be used to parse the name of each file for additional
metadata. It should accept a single string (the filename), and return
a dictionary of fields and values. These fields will be added to the
resulting :class:`.Paper` instance.
index_by : str
(default: 'fileied') Field on :class:`.Paper` to use as the primary
index.
structured : bool
(default: True) If True, the contents of the document collection will be
represented by a :class:`.StructuredFeatureSet`\. If False, a
:class:`.FeatureSet` will be used instead. Setting ``structured=False``
is appropriate if word-order does not matter (e.g. topic modeling).
corpus : bool
(default: True) If False, will return a list of :class:`.Paper`
instances rather than a :class:`.Corpus`\.
kwargs : kwargs
Any additional kwargs will be passed to the
:class:`nltk.corpus.reader.plaintext.PlaintextCorpusReader` constructor.
Refer to the `NLTK documentation
<http://www.nltk.org/api/nltk.corpus.reader.html#nltk.corpus.reader.plaintext.PlaintextCorpusReader>`_
for details.
Returns
-------
:class:`.Corpus`
"""
thanks for the reply!
can I use Tethne to do topic modeling for my own txt files, about 700 different files?