aws / amazon-sagemaker-examples

Example 📓 Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using 🧠 Amazon SageMaker.
https://sagemaker-examples.readthedocs.io
Apache License 2.0
9.98k stars 6.73k forks source link

Sagemaker LDA topic model - how to access the params of the trained model? Also is there a simple way to capture coherence #651

Open dusvyat opened 5 years ago

dusvyat commented 5 years ago

I'm new to Sagemaker and am running some tests to measure the performance of NTM and LDA on AWS compared with LDA mallet and native Gensim LDA model.

I'm wanting to inspect the trained models on Sagemaker and look at stuff like what words have the highest contribution for each topic. And also to get a measure of model coherence.

I have been able to successfully get what words have the highest contribution for each topic for NTM on Sagemaker by downloading the output file untarring it and unzipping to expose 3 files params, symbol.json and meta.json.

However, when I try to do the same process for LDA, the untarred output file cannot be unzipped.

Maybe I'm missing something or should do something different for LDA compared with NTM but I have not been able to find any documentation on this. Also, anyone found a simple way to calculate model coherence?

Any assistance would be greatly appreciated!

cswiercz commented 5 years ago

Duplicated from this StackOverflow response.

This SageMaker notebook, which dives into the scientific details of LDA, also demonstrates how to inspect the model artifacts. Specifically, how to obtain the estimates for the Dirichlet prior alpha and the topic-word distribution matrix beta. You can find the instructions in the section titled "Inspecting the Trained Model". For convenience, I will reproduce the relevant code here:

import tarfile
import mxnet as mx

# extract the tarball
tarflie_fname = FILENAME_PREFIX + 'model.tar.gz' # wherever the tarball is located
with tarfile.open(tarfile_fname) as tar:
    tar.extractall()

# obtain the model file (should be the only file starting with "model_")
model_list = [
    fname
    for fname in os.listdir(FILENAME_PREFIX)
    if fname.startswith('model_')
]
model_fname = model_list[0]

# load the contents of the model file into MXNet arrays
alpha, beta = mx.ndarray.load(model_fname)

That should get you the model data. Note that the topics, which are stored as rows of beta, are not presented in any particular order.