Closed ComputerMaestro closed 5 years ago
I have made some changes. Please check for travis.yml
Some more adjustments are needed , I will do them shortly.
I will write test for IMDB and Twitter as soon as possible. For SST-2 I will make another PR.
I have added the tests as well.
thanks Tests seem to be failing, could you take a look at that?
Sure @oxinabox , I will look at it and will let you know when it is rectified.
ERROR: LoadError: InitError: could not open file /home/travis/build/JuliaText/CorpusLoaders.jl/test/WikiCorpus_DataDeps.jl
I am not able to getting the cause for this error in travis. The problem is that the WikiCorpus_DataDeps.jl
file is inside src
folder and is referred inside init
funciton CorpusLoaders
module and then CorpusLoaders
is referred inside test
folder and since this folder doesn't contain WikiCorpus_DataDeps.jl
, so it cannot open it inside init
function
@Ayushk4 how is https://github.com/JuliaText/WordTokenizers.jl/pull/13 going? it would be good to be using that here.
I am pretty sure we do not need the DataFrames.jl dependency. Only the CSV one. CSV has a dependency on DataFrames, but we don't ever call any methods from the DataFrames namepace.
I have removed DataFrames.jl
.
I tried making separate file for init
but problem is still the same.
Please check whether it is done correctly or not.
I will try doing something else and let you know.
There is some pre compiling problem with the CSV
module as well.
Sorry for the silly question but can you tell me how to set DATADEPS_ALWAYS_ACCEPT
to accept without prompt. Is it by directly setting this to true or something else.
ENV["DATADEPS_ALWAYS_ACCEPT"] = "true"
within julia
or something like `export DATADEPS_ALWAYS_ACCEPT="true" in bash before you start julia
In TravisCI see their docs https://docs.travis-ci.com/user/environment-variables/
Thankyou
Now CSV
problem is solved. But one of yours test in test_wikicorpus.jl
is failing. Due to a anomaly word
StringIndexError("Chögyam", 4)
So for now I have commented out that test in line 28, please suggest what to do with that.
@oxinabox , I saw the travis
and appveyor
results there Julia nightly and Julia 1.0.3 passed all the tests but Julia 0.7.0 was not able to pass SemCor
results. I tested it on my machine and there SemCor
and Senseval3
test are not passing. And I think the problem exists inside Base.Iterators
because codes are exaclty the same. Problem comes when we check for the types
of outputs.
I compared results from Julia 1.0.3 and Julia 0.7, problem was in flatten_levels
function from MultiIteratorsResolution
module. In both versions I am using v0.5 of MultiIteratorsResolution
. So, I checked code of flatten_levels
in MultiIteratorsResolution
repo. There apply_at_level
function is used inside flatten_levels
definition. And one of apply_at_level
argument is Base.Iterators.flatten
.
When I used flatten_levels
with the output type is same in both Julia versions but when the output of flatten_levels
is used for iteration in for loop it then it behaves differently in both versions. The behavior in Julia 1.0.3 is corrected(Expected) but in Julia 0.7.0 is incorrect for the tests.
I used this is in both versions:
julia> words = consolidate(flatten_levels(docs, (!lvls)(SemCor, :word)))
Julia 0.7: Type of words
is this: which is not what is expected
julia> typeof(words)
Array{Base.Iterators.Flatten,1}
This is what words
is in Julia 0.7
julia> words
21459-element Array{Base.Iterators.Flatten,1}:
Base.Iterators.Flatten{Base.Generator{Base.Iterators.Zip1{PosTaggedWord},getfield(IterTools, Symbol("##1#2")){getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}}}}(Base.Generator{Base.Iterators.Zip1{PosTaggedWord},getfield(IterTools, Symbol("##1#2")){getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}}}(getfield(IterTools, Symbol("##1#2")){getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}}(getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}(5, Core.Box(getfield(MultiResolutionIterators, Symbol("#inner#3")){getfield(MultiResolutionIterators, Symbol("##6#7")){typeof(flatten),Array{Int64,1}},Int64}(getfield(MultiResolutionIterators, Symbol("##6#7")){typeof(flatten),Array{Int64,1}}(Base.Iterators.flatten, [1, 2, 3, 5]), 5, Core.Box(#= circular reference @-2 =#))))), Base.Iterators.Zip1{PosTaggedWord}(PosTaggedWord("DT", "The"))))
⋮
I used the same words
declaration in Julia 1.0.3: and got expected type
julia> typeof(words)
Array{TaggedWord,1}
This is words
in Julia 1.0.3: [And this the expected form]
julia> words
21459-element Array{TaggedWord,1}:
PosTaggedWord("DT", "The")
SenseAnnotatedWord("NNP", "group", 1, "1:03:00::", "Fulton_County_Grand_Jury")
SenseAnnotatedWord("VB", "say", 1, "2:32:00::", "said")
SenseAnnotatedWord("NN", "friday", 1, "1:28:00::", "Friday")
PosTaggedWord("DT", "an")
SenseAnnotatedWord("NN", "investigation", 1, "1:09:00::", "investigation")
PosTaggedWord("IN", "of")
SenseAnnotatedWord("NN", "atlanta", 1, "1:15:00::", "Atlanta")
PosTaggedWord("POS", "'s")
SenseAnnotatedWord("JJ", "recent", 2, "5:00:00:past:00", "recent")
SenseAnnotatedWord("NN", "primary_election", 1, "1:04:00::", "primary_election")
SenseAnnotatedWord("VB", "produce", 4, "2:39:01::", "produced")
⋮
But when I used below line in both Julia versions: I got the same output:
julia> flatten_levels(docs, (!lvls)(SemCor, :word))
Base.Iterators.Flatten{Base.Generator{Base.Iterators.Zip1{Array{Document{Array{Array{Array{TaggedWord,1},1},1},String},1}},getfield(IterTools, Symbol("##1#2")){getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}}}}(Base.Generator{Base.Iterators.Zip1{Array{Document{Array{Array{Array{TaggedWord,1},1},1},String},1}},getfield(IterTools, Symbol("##1#2")){getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}}}(getfield(IterTools, Symbol("##1#2")){getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}}(getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}(1, Core.Box(getfield(MultiResolutionIterators, Symbol("#inner#3")){getfield(MultiResolutionIterators, Symbol("##6#7")){typeof(flatten),Array{Int64,1}},Int64}(getfield(MultiResolutionIterators, Symbol("##6#7")){typeof(flatten),Array{Int64,1}}(Base.Iterators.flatten, [1, 2, 3, 5]), 5, Core.Box(#= circular reference @-2 =#))))), Base.Iterators.Zip1{Array{Document{Array{Array{Array{TaggedWord,1},1},1},String},1}}(Document{Array{Array{Array{TaggedWord,1},1},1},String}[Document{Array{Array{Array{TaggedWord,1},1},1},String}("br-a01", Array{Array{TaggedWord,1},1}[[[PosTaggedWord("DT", "The"), SenseAnnotatedWord("NNP", "group", 1, "1:03:00::", "Fulton_County_Grand_Jury"), SenseAnnotatedWord .......
This has passed all the tests because this does not contains julia 0.7.
But CorpusLoaders.jl
is working for Julia 1.0.3
I have made suggested changes. Please check them out. And sorry for silly mistakes.
I tried to set DATADEP_ALWAYS_ACCEPT
to true
in Changes made
commit, but the tests in travis
and appveyor
failed are showing that DATADEP_ALWAYS_ACCEPT
not set to true
.
And sorry for silly mistakes.
No need to be sorry. This is what code review is for. This is the system working as intended.
I am really busy this week and might not get back to reviewing this til next week.
Okay @oxinabox , till then I will be working on TextAnalysis
and If get any idea to solve for Julia 0.7.0 I will fix that as well.
For now I have only added sentiment part of the Stanford sentiment dataset.
some of the appveyor tests are failing because it is timeout due to downloading of wikicorpus
dataset otherwise all tests have passed
I still have not done a full review, and am still real busy but here is some to work on.
A general thing: I do not think strings are a suitable type for sentiment, ever. I think an Enum might be suitable. Esp if we map the labels to good numbers
For labels are you talking about neg
, pos
??
For using enum
here , can I map neg
to -1 pos
to 1 and neutral
to 0?
@ComputerMaestro that sounds good to me
@oxinabox , I have added enums
instead of strings for sentiment datasets. Please check whether it is right way or not
We should get rid of DelimitedFiles and just use CSV.jl It's better and should be able to do all the same things.
@oxinabox , I have structured Stanford sentiment dataset
in a different way. I have removed DelimitedFiles
@oxinabox , is this fine now or any changes are there??
I'm pretty busy and am hoping @aviks can take a look
I have three more datasets: WikiText-103 WikiText-2 Yelp dataset Should we add them as well??
In a seperate PR. Small PRs are much easier to review and can be merged faster
Is this fine? Or any changes are left?
Is this fine? Or any changes are left?
Sorry this is taking so long. The overall quality of the code, in terms of cleanlyness and idiomatic programming is still not great.
I think after the comments i made are addressed it will be acceptable.
I checked that all the above test are failing due to some problem in toktok tokenizer
in WordTokenizers
v0.5.2 and they pass when used with WordTokenizers
v0.4.0. I will ask Ayush
For record keeping purposes - this also addresses #13 in a way.
Thanks! Lets get this in
I have add just the datadeps. I will be adding the files and once it is done I will let you know @oxinabox .