repeat lines in nsf corpus

cjprybol commented 7 years ago

I think lines 1 and 43 in nsfdocs.txt might be malformed. They contain repeat term identifiers and are the only two documents that do so out of the 128K documents.

julia> using TopicModelsVB

julia> nsfcorp = readcorp(:nsf)
Corpus with:
 * 128804 docs
 * 25319 lex
 * 0 users

julia> find(d -> length(nsfcorp.docs[d].terms) != length(unique(nsfcorp.docs[d].terms)), 1:length(nsfcorp.docs))
2-element Array{Int64,1}:
  1
 14

julia> nsfcorp.docs[1]
Document with:
 * 152 terms
 * 0 readers
 * 1.9910801e7 stamp

julia> nsfcorp.docs[14]
Document with:
 * 121 terms
 * 0 readers
 * 1.9900912e7 stamp

julia> length(unique(nsfcorp.docs[1].terms))
76

julia> length(unique(nsfcorp.docs[14].terms))
118

I found this while writing some tests to make sure I understood how the models were structured. I checked the file locally and on GitHub and they both show the same thing, document 1 has 76 unique entries that are repeated once for IDs and counts. Didn't check document 14, but I presume something similar. Any chance you have access to a copy of the original file to see if this is a mistake in the repository or if it's present in the original NSF corpus?

Great package by the way!

ericproffitt commented 7 years ago

Hi cjprybol,

Thanks for bringing this to my attention!

You can find the original files for the dataset on the repo in the datasets folder. This dataset was taken from UCI's open-source ML datasets collection, it's definitely a mistake in the dataset I formatted, but there's also a chance it's a mistake in their original corpus on UCI's site.

When I get some time I'll fix this and check to see if the error comes from the original corpus. I've been meaning to give this package a much needed update, as there's several other outstanding issues that need to be addressed, hopefully I can get to it in the near future.

As an aside, I see you're currently a genetics PhD student at Stanford. I'm a bioinformatics scientist working for a genomics company in the NYC/NJ region, we're currently growing several of our R&D teams, so if you're interested you can find my name on the paper in my CopyNumberVariation repo and message me on LinkedIn!

baggepinnen commented 7 years ago

Line number 111425 in .julia/v0.6/TopicModelsVB/datasets/nsf/nsftitles.txt contains an invalid character which causes an error when

julia> nsfcorp = readcorp(:nsf)
ERROR: at row 111425, column 0 : UnicodeError: invalid character index)
Stacktrace:
 [1] dlm_parse(::String, ::Char, ::Char, ::Char, ::Char, ::Bool, ::Bool, ::Bool, ::Int64, ::Bool, ::Base.DataFmt.DLMOffsets) at ./datafmt.jl:610
 [2] readdlm_string(::String, ::Char, ::Type, ::Char, ::Bool, ::Dict{Symbol,Union{Char, Integer, Tuple{Integer,Integer}}}) at ./datafmt.jl:343
 [3] #readdlm_auto#11(::Array{Any,1}, ::Function, ::String, ::Char, ::Type{T} where T, ::Char, ::Bool) at ./datafmt.jl:132
 [4] #readdlm#8 at ./datafmt.jl:114 [inlined]
 [5] readdlm(::String, ::Char, ::Type{T} where T, ::Char) at ./datafmt.jl:114
 [6] #readdlm#4 at ./datafmt.jl:54 [inlined]
 [7] readdlm(::String, ::Char, ::Type{T} where T) at ./datafmt.jl:54
 [8] #readcorp#17(::String, ::String, ::String, ::String, ::Char, ::Bool, ::Bool, ::Bool, ::Bool, ::TopicModelsVB.#readcorp) at /local/home/fredrikb/.julia/v0.6/TopicModelsVB/src/Corpus.jl:151
 [9] (::TopicModelsVB.#kw##readcorp)(::Array{Any,1}, ::TopicModelsVB.#readcorp) at ./<missing>:0
 [10] readcorp(::Symbol) at /local/home/fredrikb/.julia/v0.6/TopicModelsVB/src/Corpus.jl:431

The function call succeeds if the invalid character is removed

ericproffitt commented 7 years ago

There are a number of breaking changes in v0.6 which cause problems, I should be releasing the v0.6 compatible version in the next month or so.

ericproffitt commented 7 years ago

Ok so both of these issues should be fixed now, please open a new issue if they aren't.

ericproffitt / TopicModelsVB.jl

repeat lines in nsf corpus #4