JuliaText / CorpusLoaders.jl

A variety of loaders for various NLP corpora.
Other
32 stars 13 forks source link

Adding IMDB, Twitter and Stanford Sentiment datasets #18

Closed ComputerMaestro closed 5 years ago

ComputerMaestro commented 5 years ago

I have add just the datadeps. I will be adding the files and once it is done I will let you know @oxinabox .

ComputerMaestro commented 5 years ago

I have made some changes. Please check for travis.yml

ComputerMaestro commented 5 years ago

Some more adjustments are needed , I will do them shortly.

ComputerMaestro commented 5 years ago

I will write test for IMDB and Twitter as soon as possible. For SST-2 I will make another PR.

ComputerMaestro commented 5 years ago

I have added the tests as well.

oxinabox commented 5 years ago

thanks Tests seem to be failing, could you take a look at that?

ComputerMaestro commented 5 years ago

Sure @oxinabox , I will look at it and will let you know when it is rectified.

ComputerMaestro commented 5 years ago

ERROR: LoadError: InitError: could not open file /home/travis/build/JuliaText/CorpusLoaders.jl/test/WikiCorpus_DataDeps.jl I am not able to getting the cause for this error in travis. The problem is that the WikiCorpus_DataDeps.jl file is inside src folder and is referred inside init funciton CorpusLoaders module and then CorpusLoaders is referred inside test folder and since this folder doesn't contain WikiCorpus_DataDeps.jl , so it cannot open it inside init function

oxinabox commented 5 years ago

@Ayushk4 how is https://github.com/JuliaText/WordTokenizers.jl/pull/13 going? it would be good to be using that here.

oxinabox commented 5 years ago

I am pretty sure we do not need the DataFrames.jl dependency. Only the CSV one. CSV has a dependency on DataFrames, but we don't ever call any methods from the DataFrames namepace.

ComputerMaestro commented 5 years ago

I have removed DataFrames.jl.

ComputerMaestro commented 5 years ago

I tried making separate file for init but problem is still the same. Please check whether it is done correctly or not.

ComputerMaestro commented 5 years ago

I will try doing something else and let you know.

ComputerMaestro commented 5 years ago

There is some pre compiling problem with the CSV module as well.

ComputerMaestro commented 5 years ago

Sorry for the silly question but can you tell me how to set DATADEPS_ALWAYS_ACCEPT to accept without prompt. Is it by directly setting this to true or something else.

oxinabox commented 5 years ago

ENV["DATADEPS_ALWAYS_ACCEPT"] = "true" within julia or something like `export DATADEPS_ALWAYS_ACCEPT="true" in bash before you start julia

In TravisCI see their docs https://docs.travis-ci.com/user/environment-variables/

ComputerMaestro commented 5 years ago

Thankyou

ComputerMaestro commented 5 years ago

Now CSV problem is solved. But one of yours test in test_wikicorpus.jl is failing. Due to a anomaly word StringIndexError("Chögyam", 4) So for now I have commented out that test in line 28, please suggest what to do with that.

ComputerMaestro commented 5 years ago

@oxinabox , I saw the travis and appveyor results there Julia nightly and Julia 1.0.3 passed all the tests but Julia 0.7.0 was not able to pass SemCor results. I tested it on my machine and there SemCor and Senseval3 test are not passing. And I think the problem exists inside Base.Iterators because codes are exaclty the same. Problem comes when we check for the types of outputs. I compared results from Julia 1.0.3 and Julia 0.7, problem was in flatten_levels function from MultiIteratorsResolution module. In both versions I am using v0.5 of MultiIteratorsResolution. So, I checked code of flatten_levels in MultiIteratorsResolution repo. There apply_at_level function is used inside flatten_levels definition. And one of apply_at_level argument is Base.Iterators.flatten. When I used flatten_levels with the output type is same in both Julia versions but when the output of flatten_levels is used for iteration in for loop it then it behaves differently in both versions. The behavior in Julia 1.0.3 is corrected(Expected) but in Julia 0.7.0 is incorrect for the tests.

I used this is in both versions:

julia> words = consolidate(flatten_levels(docs, (!lvls)(SemCor, :word)))

Julia 0.7: Type of words is this: which is not what is expected

julia> typeof(words)
Array{Base.Iterators.Flatten,1}

This is what words is in Julia 0.7

julia> words
21459-element Array{Base.Iterators.Flatten,1}:
 Base.Iterators.Flatten{Base.Generator{Base.Iterators.Zip1{PosTaggedWord},getfield(IterTools, Symbol("##1#2")){getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}}}}(Base.Generator{Base.Iterators.Zip1{PosTaggedWord},getfield(IterTools, Symbol("##1#2")){getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}}}(getfield(IterTools, Symbol("##1#2")){getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}}(getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}(5, Core.Box(getfield(MultiResolutionIterators, Symbol("#inner#3")){getfield(MultiResolutionIterators, Symbol("##6#7")){typeof(flatten),Array{Int64,1}},Int64}(getfield(MultiResolutionIterators, Symbol("##6#7")){typeof(flatten),Array{Int64,1}}(Base.Iterators.flatten, [1, 2, 3, 5]), 5, Core.Box(#= circular reference @-2 =#))))), Base.Iterators.Zip1{PosTaggedWord}(PosTaggedWord("DT", "The"))))
 ⋮

I used the same words declaration in Julia 1.0.3: and got expected type

julia> typeof(words)
Array{TaggedWord,1}

This is words in Julia 1.0.3: [And this the expected form]

julia> words
21459-element Array{TaggedWord,1}:
 PosTaggedWord("DT", "The")

 SenseAnnotatedWord("NNP", "group", 1, "1:03:00::", "Fulton_County_Grand_Jury")
 SenseAnnotatedWord("VB", "say", 1, "2:32:00::", "said")

 SenseAnnotatedWord("NN", "friday", 1, "1:28:00::", "Friday")

 PosTaggedWord("DT", "an")

 SenseAnnotatedWord("NN", "investigation", 1, "1:09:00::", "investigation")
 PosTaggedWord("IN", "of")

 SenseAnnotatedWord("NN", "atlanta", 1, "1:15:00::", "Atlanta")

 PosTaggedWord("POS", "'s")

 SenseAnnotatedWord("JJ", "recent", 2, "5:00:00:past:00", "recent")

 SenseAnnotatedWord("NN", "primary_election", 1, "1:04:00::", "primary_election")
 SenseAnnotatedWord("VB", "produce", 4, "2:39:01::", "produced")

 ⋮

But when I used below line in both Julia versions: I got the same output:

julia> flatten_levels(docs, (!lvls)(SemCor, :word))
Base.Iterators.Flatten{Base.Generator{Base.Iterators.Zip1{Array{Document{Array{Array{Array{TaggedWord,1},1},1},String},1}},getfield(IterTools, Symbol("##1#2")){getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}}}}(Base.Generator{Base.Iterators.Zip1{Array{Document{Array{Array{Array{TaggedWord,1},1},1},String},1}},getfield(IterTools, Symbol("##1#2")){getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}}}(getfield(IterTools, Symbol("##1#2")){getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}}(getfield(MultiResolutionIterators, Symbol("##2#5")){Int64}(1, Core.Box(getfield(MultiResolutionIterators, Symbol("#inner#3")){getfield(MultiResolutionIterators, Symbol("##6#7")){typeof(flatten),Array{Int64,1}},Int64}(getfield(MultiResolutionIterators, Symbol("##6#7")){typeof(flatten),Array{Int64,1}}(Base.Iterators.flatten, [1, 2, 3, 5]), 5, Core.Box(#= circular reference @-2 =#))))), Base.Iterators.Zip1{Array{Document{Array{Array{Array{TaggedWord,1},1},1},String},1}}(Document{Array{Array{Array{TaggedWord,1},1},1},String}[Document{Array{Array{Array{TaggedWord,1},1},1},String}("br-a01", Array{Array{TaggedWord,1},1}[[[PosTaggedWord("DT", "The"), SenseAnnotatedWord("NNP", "group", 1, "1:03:00::", "Fulton_County_Grand_Jury"), SenseAnnotatedWord .......
ComputerMaestro commented 5 years ago

This has passed all the tests because this does not contains julia 0.7. But CorpusLoaders.jl is working for Julia 1.0.3

ComputerMaestro commented 5 years ago

I have made suggested changes. Please check them out. And sorry for silly mistakes.

ComputerMaestro commented 5 years ago

I tried to set DATADEP_ALWAYS_ACCEPT to true in Changes made commit, but the tests in travis and appveyor failed are showing that DATADEP_ALWAYS_ACCEPT not set to true.

oxinabox commented 5 years ago

And sorry for silly mistakes.

No need to be sorry. This is what code review is for. This is the system working as intended.

I am really busy this week and might not get back to reviewing this til next week.

ComputerMaestro commented 5 years ago

Okay @oxinabox , till then I will be working on TextAnalysis and If get any idea to solve for Julia 0.7.0 I will fix that as well.

ComputerMaestro commented 5 years ago

For now I have only added sentiment part of the Stanford sentiment dataset.

ComputerMaestro commented 5 years ago

some of the appveyor tests are failing because it is timeout due to downloading of wikicorpus dataset otherwise all tests have passed

oxinabox commented 5 years ago

I still have not done a full review, and am still real busy but here is some to work on.

A general thing: I do not think strings are a suitable type for sentiment, ever. I think an Enum might be suitable. Esp if we map the labels to good numbers

ComputerMaestro commented 5 years ago

For labels are you talking about neg , pos ??

ComputerMaestro commented 5 years ago

For using enum here , can I map neg to -1 pos to 1 and neutral to 0?

oxinabox commented 5 years ago

@ComputerMaestro that sounds good to me

ComputerMaestro commented 5 years ago

@oxinabox , I have added enums instead of strings for sentiment datasets. Please check whether it is right way or not

oxinabox commented 5 years ago

We should get rid of DelimitedFiles and just use CSV.jl It's better and should be able to do all the same things.

ComputerMaestro commented 5 years ago

@oxinabox , I have structured Stanford sentiment dataset in a different way. I have removed DelimitedFiles

ComputerMaestro commented 5 years ago

@oxinabox , is this fine now or any changes are there??

oxinabox commented 5 years ago

I'm pretty busy and am hoping @aviks can take a look

ComputerMaestro commented 5 years ago

I have three more datasets: WikiText-103 WikiText-2 Yelp dataset Should we add them as well??

oxinabox commented 5 years ago

In a seperate PR. Small PRs are much easier to review and can be merged faster

ComputerMaestro commented 5 years ago

Is this fine? Or any changes are left?

oxinabox commented 5 years ago

Is this fine? Or any changes are left?

Sorry this is taking so long. The overall quality of the code, in terms of cleanlyness and idiomatic programming is still not great.

I think after the comments i made are addressed it will be acceptable.

ComputerMaestro commented 5 years ago

I checked that all the above test are failing due to some problem in toktok tokenizer in WordTokenizers v0.5.2 and they pass when used with WordTokenizers v0.4.0. I will ask Ayush

Ayushk4 commented 5 years ago

For record keeping purposes - this also addresses #13 in a way.

oxinabox commented 5 years ago

Thanks! Lets get this in