Closed grassdew closed 7 years ago
If you mean from the raw text files then no. It would be very difficult to design such a function which could anticipate all the possible formatting idiosyncrasies of raw text.
Hi ericproffitt,
Thanks. I'm new to Julia and have trouble creating these files. Can you give me some suggestions on this? Also, are the terms in the lexfile required to be sorted in alphabetic order for the model fitting? Thank you.
There really isn't any standard way to do it, except to slowly build a dictionary of keys to words based on those words in your text file. As to the lexfile, no it's not required to be sorted in alphabetic order.
Thank you.
Hi ericproffitt,
I have created a docfile for my own dataset and tried to fit the LDA model on the corpus. Originally when I read in the corpus, it included 5 docs, but it became empty after I ran the command fixcorp!(corpus). How to deal with this issue?
This is difficult to troubleshoot without seeing your docfile, is it small enough to copy-paste it in a comment?
My docfile looks like this. 13,11,6 1,1,1 17,11,6 1,1,1 12,4,5,10,14,7,6,21 1,1,1,1,1,1,1,1 2,3,8,19,1,9,15,20,18,16 1,2,2,2,1,2,2,1,1,1 5,4 1,1
Because you don't include a lexfile when you read your corpus, you need to run padcorp!(corp) before you run fixcorp!(corp). The padcorp! function basically produces a generic lexicon based on your docfile (this is discussed in the tutorial).
Just be aware that your output won't be interpretable, since you never tell the model what words actually correspond to these keys.
Actually, I have my lexfile, which looks like this: 1 capped 2 resides 3 extended 4 smoking 5 history 6 alcohol 7 significant 8 care 9 tracheostomy 10 believe 11 tobacco 12 patient 13 includes 14 heavy 15 tube 16 admission 17 denies 18 discharge 19 facility 20 unfortunately 21 past
But I got an error message after I run the following command.
corp = readcorp(;docfile="topic_modeling/lda/ldadocs.txt", lexfile="topic_modeling/lda/ldalex.txt", delim=',',counts=true,readers=false, ratings=false, stamps=false) The error message shows: attempt to access 21*1 Array at index [Colon(),2].
the lexfile (and userfile) needs to be tab spaced (I probably should have mentioned this in the tutorial), replace the spaces with tabs and you should be good just running:
readcorp(docfile=docfile_path, lexfile=lexfile_path)
It would be helpful to have this information in the tutorial. By the way, after I run the command: train!(lda, iter=150, tol=0.0), I get an error: ArgumentError: Dirichlet: alpha must be a positive vector. It seems that this has something to do with the number of topics. It works well after I change the number of topics.
My script for LDA is as follows: lda = LDA(corp, 3) train!(lda, iter=50, tol=0.0)
I get the output: topic 1 topic 2 topic 3 capped 2 capped 2 capped 2 Is there anything wrong with my specification of parameters or it's simply because my sample size is too small?
That's strange that you're getting an error. I just ran the following and I didn't get an error:
corp = readcorp(docfile=docfile_path, lexfile=lexfile_path, counts=true)
lda = LDA(corp, 3)
train!(lda, iter=50, tol=0.0)
These are the topics I got:
topic 1 topic 2 topic 3
tobacco facility history
alcohol tube smoking
denies tracheostomy past
includes care heavy
past extended patient
unfortunately unfortunately believe
facility discharge significant
discharge admission alcohol
admission resides unfortunately
tube capped facility
heavy past discharge
patient denies denies
believe heavy admission
tracheostomy includes tube
care patient includes
Can you post the traceback so I can see where the error is originating?
I got an error when I ran: lda = LDA(corp,3). Below is the error message.
BoundsError: attempt to access 3×1 Array{Float64,2} at index [Colon(),[13,11,6]] in throw_boundserror(::Array{Float64,2}, ::Tuple{Colon,Array{Int64,1}}) at ./abstractarray.jl:363 in checkbounds at ./abstractarray.jl:292 [inlined] in _getindex at ./multidimensional.jl:272 [inlined] in getindex at ./abstractarray.jl:760 [inlined] in Elogpw(::TopicModelsVB.LDA, ::Int64) at /Users/emily/.julia/v0.5/TopicModelsVB/src/LDA.jl:62 in updateNewELBO!(::TopicModelsVB.LDA, ::Int64) at /Users/emily/.julia/v0.5/TopicModelsVB/src/LDA.jl:84 in TopicModelsVB.LDA(::TopicModelsVB.Corpus, ::Int64) at /Users/emily/.julia/v0.5/TopicModelsVB/src/LDA.jl:41 in include_string(::String, ::String) at ./loading.jl:441 in include_string(::String, ::String) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:? in eval(::Module, ::Any) at ./boot.jl:234 in eval(::Module, ::Any) at /Applications/Julia-0.5.app/Contents/Resources/julia/lib/julia/sys.dylib:? in (::Atom.##67#70)() at /Users/emily/.julia/v0.5/Atom/src/eval.jl:40 in withpath(::Atom.##67#70, ::Void) at /Users/emily/.julia/v0.5/CodeTools/src/utils.jl:30 in withpath(::Function, ::Void) at /Users/emily/.julia/v0.5/Atom/src/eval.jl:46 in macro expansion at /Users/emily/.julia/v0.5/Atom/src/eval.jl:109 [inlined] in (::Atom.##66#69)() at ./task.jl:60
when you enter:
corp
does it say you have 5 docs and 21 lex?
Also try entering
lda.beta
what is the dimension of that matrix?
When I enter corp, it returns 5 docs and 1 lex. Guess that's why there is only one word in each topic. Here is my lexfile, it's tab delimited now. 1 capped 2 resides 3 extended 4 smoking 5 history 6 alcohol 7 significant 8 care 9 tracheostomy 10 believe 11 tobacco 12 patient 13 includes 14 heavy 15 tube 16 admission 17 denies 18 discharge 19 facility 20 unfortunately 21 past
There should be 21 lex, make sure your lex file is formatted correctly. It should be plaintext, with tabs separating the keys from the words, with one key/word pair per line.
I created my lexfile in Excel and save it as tab delimited text. ldalex.txt
Yah it's definitely Excel, Excel does all sorts of weird hidden formatting to files. Just open up a plaintext editor and type your lexfile in there and it should work.
I type the lexfile in a text editor, but it still doesn't work. The error appears right after I read the corpus.
BoundsError: attempt to access 1×1 Array{Any,2} at index [Colon(),2]
in throw_boundserror(::Array{Any,2}, ::Tuple{Colon,Int64}) at ./abstractarray.jl:363
in checkbounds at ./abstractarray.jl:292 [inlined]
in _getindex at ./multidimensional.jl:272 [inlined]
in getindex(::Array{Any,2}, ::Colon, ::Int64) at ./abstractarray.jl:760
in #readcorp#17(::String, ::String, ::String, ::String, ::Char, ::Bool, ::Bool, ::Bool, ::Bool, ::TopicModelsVB.#readcorp) at /Users/emily/.julia/v0.5/TopicModelsVB/src/Corpus.jl:137
in (::TopicModelsVB.#kw##readcorp)(::Array{Any,1}, ::TopicModelsVB.#readcorp) at ./
It's still a formatting issue with the file then, including hidden string literals, each line should be of the form:
(integer key)\t(word)\n
Are you on a windows machine or linux/macos?
I'm using Atom(text editor) on a mac.
It's still a formatting issue with the file then, it's not the package. Take a look at your ldalex file in Julia:
ldalex = readdlm(ldalex_path)
ldalex[1]
You get:
"\U3cbbe1c0c\0a\0p\0p\0e\0d\0"
So clearly there's some funky formatting going on.
I got the following.
ldalex = readdlm(ldalex_path). 1×43 Array{Any,2}: 1 "capped" 2 "resides" 3 … 20 "unfortunately" 21 "past" "" ldalex[1] 1
The format must have changed after I re-exported it from Excel. But as you can see with the one you posted, it's not right. The array should be 21x2 rather than 1x43. Clearly the file is not formatted correctly, you're missing the newlines for some reason.
Open up textedit, choose the Make Plain Text option in the Format menu, and then type it in there, and it should work.
Oops, I guess Julia shows them with quotes, so it's just the newline characters (\n) which you're missing.
When I typed them up, I use the Enter key.
Finally it works!
Yah when writing up text from scratch to be used programmatically, unless you are going for a specific file format, you should always make it plain unicode text I think (not an expert on this). Otherwise you get all sorts of crazy hidden formatting occurring in the text file.
Thanks for your help. I will take your advice.
Hi,
Thanks for developing this amazing package. I have gone through the examples and have a quick question about data preparation. Is there a function within the package that can help users quickly prepare the docfile and lexfile for their own documents? Thanks in advance.