diging / tethne

Python module for bibliographic network analysis.
http://diging.github.io/tethne/
GNU General Public License v3.0
81 stars 32 forks source link

Transform FeatureSet #146

Closed ghost closed 8 years ago

ghost commented 8 years ago

Hi,

i refer to your Notebook 6. Words and topic modeling.ipynb

I tried to follow your code and use it for my own WoS-Corpus. You create a FeatureSet for the abstract field. Then you apply a filter with transform() on the FeatureSet. You want to remove stopwords from stoplist and words with a document frequency between 2 and 400.

print 'There are {0} features in the abstract featureset.'.format(len(wosCorpus.features['abstract'].index))

filter = lambda f, v, c, dc: f not in stoplist and 2 < dc < 400
wosCorpus.features['abstract_filtered'] = wosCorpus.features['abstract'].transform(filter)

print 'There are {0} features in the abstract featureset.'.format(len(wosCorpus.features['abstract_filtered'].index))

But both FeatureSets have the same length? It looks like there were no tokens removed. Is there an error in the filter or am i missing something?

I removed the stopwords beforehand. But the document frequency filtering doesnt seem to work. screen

Also could you explain the mentioned abstract_to_features() method? I can't seem to find it.

Thank you.

erickpeirson commented 8 years ago

@epipremnum Some of those notebooks are hopelessly out of date -- I'll take a look soon and see what's going on. We are in bad need of an update to the documentation and examples. :-P

ghost commented 8 years ago

Hi Erick,

thanks for your reply. I figured it out.

If i only have word counts for the abstract field, do i need a StructuredFeatureSet like in the example? It works for me if i set structured=False.

Thank you.

screen2

erickpeirson commented 8 years ago

@epipremnum If you're just topic modeling, structured=False is the way to go.

But it is odd that transform() is not working on the StructuredFeatureSet. I have created TETHNE-126 for this. When you have a moment, can you update this thread with the version of Tethne that you are using? (i.e. pip show tethne)

erickpeirson commented 8 years ago

Ok, I think that I see what happened here. In StructuredFeatureSet the transform() method was checking explicitly for None to exclude, and letting False slip by. In FeatureSet, transform() just checks for Falsiness. Fix forthcoming.

erickpeirson commented 8 years ago

Ok, this is fixed in 1b70d10, and will make it into v0.8.1-beta.

@epipremnum If you're still using Tethne, it would be great to get your help building our new Q/A group (here). I'm hoping that this can help make up for my slow pace on documentation. Thanks!