bharatambati / sent-compl

Assessing Relative Sentence Complexity using Incremental CCG parsers
3 stars 0 forks source link

How can I download the pairs of sentences ? #2

Open DanieleSchicchi opened 4 years ago

DanieleSchicchi commented 4 years ago

Hello, I would like download the sentences used to train and test the system in a plain-text format. I tried to look into the "data" folder but there are only numbers without text.

best

bharatambati commented 4 years ago

Hi,

As mentioned in the README, "Contact Sowmya (sowmya@iastate.edu) and checkout https://bitbucket.org/nishkalavallabhi/complexity-features for data"

-- Bharat, http://bharatambati.com/

On Mon, 25 May 2020 at 20:57, DanieleSchicchi notifications@github.com wrote:

Hello, I would like download the sentences used to train and test the system in a plain-text format. I tried to look into the "data" folder but there are only numbers without text.

best

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bharatambati/sent-compl/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABW3TKGHX56BWIHODWELIATRTKE5BANCNFSM4NJTOQAQ .

DanieleSchicchi commented 4 years ago

I have already checked the bitbucket but the corpus I downloaded does not seem to be the same you used in the paper. There are not 117k pairs of sentences.

sivareddyg commented 4 years ago

if you have permission from sowmya.vajjala@nrc-cnrc.gc.ca, we can share the data.

nishkalavallabhi commented 4 years ago

Hello everyone - I don't know why my permission is needed, but I give my permission for this data sharing :-) - Sowmya.

sivareddyg commented 4 years ago

@nishkalavallabhi I thought this data is licensed by you. If it is not, I am happy to update the links to the dataset in our repo.

@DanieleSchicchi I emailed you!

nishkalavallabhi commented 4 years ago

What is the 117K sentence pairs dataset? I don't think I remember. I only had license for One Stop English (which is way smaller) and that too is publicly released now. https://github.com/nishkalavallabhi/OneStopEnglishCorpus

nishkalavallabhi commented 4 years ago

Okay, I checked your paper. 117K is the Wiki-Simple Wiki dataset after you filtered it. "As evaluation data, we use W IKI and S IMPLE W IKI parallel sentence pairs collected by Hwang et al. (2015), a newer and larger version compared to Zhu et al. (2010)’s collection. We only use the pairs from the section GOOD consisting of 150K pairs. We further removed pairs containing identical sentences which resulted in 117K clean pairs."

sivareddyg commented 4 years ago

I see! Thanks, this makes perfect sense.

On Wed, May 27, 2020, 3:48 PM Sowmya notifications@github.com wrote:

Okay, I checked your paper. 117K is the Wiki-Simple Wiki dataset after you filtered it. "As evaluation data, we use W IKI and S IMPLE W IKI parallel sentence pairs collected by Hwang et al. (2015), a newer and larger version compared to Zhu et al. (2010)’s collection. We only use the pairs from the section GOOD consisting of 150K pairs. We further removed pairs containing identical sentences which resulted in 117K clean pairs."

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bharatambati/sent-compl/issues/2#issuecomment-634902583, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAV3MZTAV7GZQUOSGV4W3X3RTVVAFANCNFSM4NJTOQAQ .

DanieleSchicchi commented 4 years ago

thanks to everyone.