ericproffitt / TopicModelsVB.jl

A Julia package for variational Bayesian topic modeling.
Other
81 stars 8 forks source link

How to prepare docfile and lexfile? #6

Closed grassdew closed 7 years ago

grassdew commented 7 years ago

Hi,

Thanks for developing this amazing package. I have gone through the examples and have a quick question about data preparation. Is there a function within the package that can help users quickly prepare the docfile and lexfile for their own documents? Thanks in advance.

ericproffitt-coopergenomics commented 7 years ago

If you mean from the raw text files then no. It would be very difficult to design such a function which could anticipate all the possible formatting idiosyncrasies of raw text.

grassdew commented 7 years ago

Hi ericproffitt,

Thanks. I'm new to Julia and have trouble creating these files. Can you give me some suggestions on this? Also, are the terms in the lexfile required to be sorted in alphabetic order for the model fitting? Thank you.

ericproffitt-coopergenomics commented 7 years ago

There really isn't any standard way to do it, except to slowly build a dictionary of keys to words based on those words in your text file. As to the lexfile, no it's not required to be sorted in alphabetic order.

grassdew commented 7 years ago

Thank you.

grassdew commented 7 years ago

Hi ericproffitt,

I have created a docfile for my own dataset and tried to fit the LDA model on the corpus. Originally when I read in the corpus, it included 5 docs, but it became empty after I ran the command fixcorp!(corpus). How to deal with this issue?

ericproffitt-coopergenomics commented 7 years ago

This is difficult to troubleshoot without seeing your docfile, is it small enough to copy-paste it in a comment?

grassdew commented 7 years ago

My docfile looks like this. 13,11,6 1,1,1 17,11,6 1,1,1 12,4,5,10,14,7,6,21 1,1,1,1,1,1,1,1 2,3,8,19,1,9,15,20,18,16 1,2,2,2,1,2,2,1,1,1 5,4 1,1

ericproffitt-coopergenomics commented 7 years ago

Because you don't include a lexfile when you read your corpus, you need to run padcorp!(corp) before you run fixcorp!(corp). The padcorp! function basically produces a generic lexicon based on your docfile (this is discussed in the tutorial).

Just be aware that your output won't be interpretable, since you never tell the model what words actually correspond to these keys.

grassdew commented 7 years ago

Actually, I have my lexfile, which looks like this: 1 capped 2 resides 3 extended 4 smoking 5 history 6 alcohol 7 significant 8 care 9 tracheostomy 10 believe 11 tobacco 12 patient 13 includes 14 heavy 15 tube 16 admission 17 denies 18 discharge 19 facility 20 unfortunately 21 past

But I got an error message after I run the following command.

corp = readcorp(;docfile="topic_modeling/lda/ldadocs.txt", lexfile="topic_modeling/lda/ldalex.txt", delim=',',counts=true,readers=false, ratings=false, stamps=false) The error message shows: attempt to access 21*1 Array at index [Colon(),2].

ericproffitt-coopergenomics commented 7 years ago

the lexfile (and userfile) needs to be tab spaced (I probably should have mentioned this in the tutorial), replace the spaces with tabs and you should be good just running:

readcorp(docfile=docfile_path, lexfile=lexfile_path)

grassdew commented 7 years ago

It would be helpful to have this information in the tutorial. By the way, after I run the command: train!(lda, iter=150, tol=0.0), I get an error: ArgumentError: Dirichlet: alpha must be a positive vector. It seems that this has something to do with the number of topics. It works well after I change the number of topics.

grassdew commented 7 years ago

My script for LDA is as follows: lda = LDA(corp, 3) train!(lda, iter=50, tol=0.0)

I get the output: topic 1 topic 2 topic 3 capped 2 capped 2 capped 2 Is there anything wrong with my specification of parameters or it's simply because my sample size is too small?

ericproffitt-coopergenomics commented 7 years ago

That's strange that you're getting an error. I just ran the following and I didn't get an error:

corp = readcorp(docfile=docfile_path, lexfile=lexfile_path, counts=true)
lda = LDA(corp, 3)
train!(lda, iter=50, tol=0.0)

These are the topics I got:

topic 1          topic 2          topic 3
tobacco          facility         history
alcohol          tube             smoking
denies           tracheostomy     past
includes         care             heavy
past             extended         patient
unfortunately    unfortunately    believe
facility         discharge        significant
discharge        admission        alcohol
admission        resides          unfortunately
tube             capped           facility
heavy            past             discharge
patient          denies           denies
believe          heavy            admission
tracheostomy     includes         tube
care             patient          includes

Can you post the traceback so I can see where the error is originating?

grassdew commented 7 years ago

I got an error when I ran: lda = LDA(corp,3). Below is the error message.

BoundsError: attempt to access 3×1 Array{Float64,2} at index [Colon(),[13,11,6]] in throw_boundserror(::Array{Float64,2}, ::Tuple{Colon,Array{Int64,1}}) at ./abstractarray.jl:363 in checkbounds at ./abstractarray.jl:292 [inlined] in _getindex at ./multidimensional.jl:272 [inlined] in getindex at ./abstractarray.jl:760 [inlined] in Elogpw(::TopicModelsVB.LDA, ::Int64) at /Users/emily/.julia/v0.5/TopicModelsVB/src/LDA.jl:62 in updateNewELBO!(::TopicModelsVB.LDA, ::Int64) at /Users/emily/.julia/v0.5/TopicModelsVB/src/LDA.jl:84 in TopicModelsVB.LDA(::TopicModelsVB.Corpus, ::Int64) at /Users/emily/.julia/v0.5/TopicModelsVB/src/LDA.jl:41 in include_string(::String, ::String) at ./loading.jl:441 in include_string(::String, ::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:? in eval(::Module, ::Any) at ./boot.jl:234 in eval(::Module, ::Any) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:? in (::Atom.##67#70)() at /Users/emily/.julia/v0.5/Atom/src/eval.jl:40 in withpath(::Atom.##67#70, ::Void) at /Users/emily/.julia/v0.5/CodeTools/src/utils.jl:30 in withpath(::Function, ::Void) at /Users/emily/.julia/v0.5/Atom/src/eval.jl:46 in macro expansion at /Users/emily/.julia/v0.5/Atom/src/eval.jl:109 [inlined] in (::Atom.##66#69)() at ./task.jl:60

ericproffitt commented 7 years ago

when you enter:

corp

does it say you have 5 docs and 21 lex?

Also try entering

lda.beta

what is the dimension of that matrix?

grassdew commented 7 years ago

When I enter corp, it returns 5 docs and 1 lex. Guess that's why there is only one word in each topic. Here is my lexfile, it's tab delimited now. 1 capped 2 resides 3 extended 4 smoking 5 history 6 alcohol 7 significant 8 care 9 tracheostomy 10 believe 11 tobacco 12 patient 13 includes 14 heavy 15 tube 16 admission 17 denies 18 discharge 19 facility 20 unfortunately 21 past

ericproffitt commented 7 years ago

There should be 21 lex, make sure your lex file is formatted correctly. It should be plaintext, with tabs separating the keys from the words, with one key/word pair per line.

grassdew commented 7 years ago

I created my lexfile in Excel and save it as tab delimited text. ldalex.txt

ericproffitt commented 7 years ago

Yah it's definitely Excel, Excel does all sorts of weird hidden formatting to files. Just open up a plaintext editor and type your lexfile in there and it should work.

grassdew commented 7 years ago

I type the lexfile in a text editor, but it still doesn't work. The error appears right after I read the corpus. BoundsError: attempt to access 1×1 Array{Any,2} at index [Colon(),2] in throw_boundserror(::Array{Any,2}, ::Tuple{Colon,Int64}) at ./abstractarray.jl:363 in checkbounds at ./abstractarray.jl:292 [inlined] in _getindex at ./multidimensional.jl:272 [inlined] in getindex(::Array{Any,2}, ::Colon, ::Int64) at ./abstractarray.jl:760 in #readcorp#17(::String, ::String, ::String, ::String, ::Char, ::Bool, ::Bool, ::Bool, ::Bool, ::TopicModelsVB.#readcorp) at /Users/emily/.julia/v0.5/TopicModelsVB/src/Corpus.jl:137 in (::TopicModelsVB.#kw##readcorp)(::Array{Any,1}, ::TopicModelsVB.#readcorp) at ./:0 in include_string(::String, ::String) at ./loading.jl:441 in include_string(::String, ::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:? in eval(::Module, ::Any) at ./boot.jl:234 in eval(::Module, ::Any) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:? in (::Atom.##67#70)() at /Users/emily/.julia/v0.5/Atom/src/eval.jl:40 in withpath(::Atom.##67#70, ::Void) at /Users/emily/.julia/v0.5/CodeTools/src/utils.jl:30 in withpath(::Function, ::Void) at /Users/emily/.julia/v0.5/Atom/src/eval.jl:46 in macro expansion at /Users/emily/.julia/v0.5/Atom/src/eval.jl:109 [inlined] in (::Atom.##66#69)() at ./task.jl:60

ericproffitt commented 7 years ago

It's still a formatting issue with the file then, including hidden string literals, each line should be of the form:

(integer key)\t(word)\n

Are you on a windows machine or linux/macos?

grassdew commented 7 years ago

I'm using Atom(text editor) on a mac.

ericproffitt commented 7 years ago

It's still a formatting issue with the file then, it's not the package. Take a look at your ldalex file in Julia:

ldalex = readdlm(ldalex_path)
ldalex[1]

You get:

"\U3cbbe1c0c\0a\0p\0p\0e\0d\0"

So clearly there's some funky formatting going on.

grassdew commented 7 years ago

I got the following.

ldalex = readdlm(ldalex_path). 1×43 Array{Any,2}: 1 "capped" 2 "resides" 3 … 20 "unfortunately" 21 "past" "" ldalex[1] 1

ericproffitt commented 7 years ago

The format must have changed after I re-exported it from Excel. But as you can see with the one you posted, it's not right. The array should be 21x2 rather than 1x43. Clearly the file is not formatted correctly, you're missing the newlines for some reason.

Open up textedit, choose the Make Plain Text option in the Format menu, and then type it in there, and it should work.

ericproffitt commented 7 years ago

Oops, I guess Julia shows them with quotes, so it's just the newline characters (\n) which you're missing.

grassdew commented 7 years ago

When I typed them up, I use the Enter key.

grassdew commented 7 years ago

Finally it works!

ericproffitt commented 7 years ago

Yah when writing up text from scratch to be used programmatically, unless you are going for a specific file format, you should always make it plain unicode text I think (not an expert on this). Otherwise you get all sorts of crazy hidden formatting occurring in the text file.

grassdew commented 7 years ago

Thanks for your help. I will take your advice.