topic modeling - Githubissues

herrtao commented 8 years ago

can I use Tethne to do topic modeling for my own txt files, about 700 different files?

erickpeirson commented 8 years ago

@herrtao Wow, somehow this completely slipped by me -- apologies for not responding.

Tethne is primarily designed for cases where you are starting with bibliographic metadata (e.g. from Web of Science, JSTOR, Zotero). If you're just working with a bunch of plain-text files, then there are potentially simpler approaches.

As a starting-place, you might take a look at the notebooks in this project. There are several different workflows -- in the topic modeling sections, there are notebooks that demonstrate LDA with Tethne/MALLET and gensim. In particular, this notebook demonstrates LDA with gensim -- if you don't have metadata, you can just skip/comment out those parts.

I hope that helps! Let me know if you have any other questions. We can also discuss further off-channel if you'd prefer (erick.peirson@asu.edu).

erickpeirson commented 8 years ago

This will be TETHNE-131

erickpeirson commented 7 years ago

@herrtao Take a look at this thread for a related discussion. It's not exactly what you asked, but maybe helpful.

erickpeirson commented 7 years ago

@herrtao Ok, as of v0.8.1.dev5 this is now a feature! Since this is a pre-release version you'll have to upgrade Tethne with the --pre flag.

pip install -U tethne --pre

Here's an example. Please let me know what you think. If you run into issues, or have other requests, please check out our new Q/A group.

>>> from tethne.readers.plain_text import read
>>> corpus = read('/path/to/directory/with/texts')

To use the corpus for topic modeling, you could then do:

>>> model = LDAModel(corpus, featureset_name='plain_text')
>>> model.fit(Z=5, max_iter=200)

More documentation will be forthcoming, but here's the docstring for now:

Generate a :class:`.Corpus` from a collection of plain-text files.

Plain-text content will be available as a feature set called "plain_text".

Uses :class:`nltk.corpus.reader.plaintext.PlaintextCorpusReader`\.

Parameters
----------
path : str
    Path to a directory containing plain text files.
pattern : str
    (default: '.+\.txt') A RegEx pattern used to select texts for inclusion
    in the corpus. By default will select any file ending in `.txt`.
extractor : function
    This function can be used to parse the name of each file for additional
    metadata. It should accept a single string (the filename), and return
    a dictionary of fields and values. These fields will be added to the
    resulting :class:`.Paper` instance.
index_by : str
    (default: 'fileied') Field on :class:`.Paper` to use as the primary
    index.
structured : bool
    (default: True) If True, the contents of the document collection will be
    represented by a :class:`.StructuredFeatureSet`\. If False, a
    :class:`.FeatureSet` will be used instead. Setting ``structured=False``
    is appropriate if word-order does not matter (e.g. topic modeling).
corpus : bool
    (default: True) If False, will return a list of :class:`.Paper`
    instances rather than a :class:`.Corpus`\.
kwargs : kwargs
    Any additional kwargs will be passed to the
    :class:`nltk.corpus.reader.plaintext.PlaintextCorpusReader` constructor.
    Refer to the `NLTK documentation
    <http://www.nltk.org/api/nltk.corpus.reader.html#nltk.corpus.reader.plaintext.PlaintextCorpusReader>`_
    for details.

Returns
-------
:class:`.Corpus`

"""

herrtao commented 7 years ago

thanks for the reply!

diging / tethne

topic modeling #122