Open DanieleSchicchi opened 4 years ago
Hi,
As mentioned in the README, "Contact Sowmya (sowmya@iastate.edu) and checkout https://bitbucket.org/nishkalavallabhi/complexity-features for data"
-- Bharat, http://bharatambati.com/
On Mon, 25 May 2020 at 20:57, DanieleSchicchi notifications@github.com wrote:
Hello, I would like download the sentences used to train and test the system in a plain-text format. I tried to look into the "data" folder but there are only numbers without text.
best
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bharatambati/sent-compl/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABW3TKGHX56BWIHODWELIATRTKE5BANCNFSM4NJTOQAQ .
I have already checked the bitbucket but the corpus I downloaded does not seem to be the same you used in the paper. There are not 117k pairs of sentences.
if you have permission from sowmya.vajjala@nrc-cnrc.gc.ca, we can share the data.
Hello everyone - I don't know why my permission is needed, but I give my permission for this data sharing :-) - Sowmya.
@nishkalavallabhi I thought this data is licensed by you. If it is not, I am happy to update the links to the dataset in our repo.
@DanieleSchicchi I emailed you!
What is the 117K sentence pairs dataset? I don't think I remember. I only had license for One Stop English (which is way smaller) and that too is publicly released now. https://github.com/nishkalavallabhi/OneStopEnglishCorpus
Okay, I checked your paper. 117K is the Wiki-Simple Wiki dataset after you filtered it. "As evaluation data, we use W IKI and S IMPLE W IKI parallel sentence pairs collected by Hwang et al. (2015), a newer and larger version compared to Zhu et al. (2010)’s collection. We only use the pairs from the section GOOD consisting of 150K pairs. We further removed pairs containing identical sentences which resulted in 117K clean pairs."
I see! Thanks, this makes perfect sense.
On Wed, May 27, 2020, 3:48 PM Sowmya notifications@github.com wrote:
Okay, I checked your paper. 117K is the Wiki-Simple Wiki dataset after you filtered it. "As evaluation data, we use W IKI and S IMPLE W IKI parallel sentence pairs collected by Hwang et al. (2015), a newer and larger version compared to Zhu et al. (2010)’s collection. We only use the pairs from the section GOOD consisting of 150K pairs. We further removed pairs containing identical sentences which resulted in 117K clean pairs."
- I think the hwang et.al dataset is publicly available here: http://ssli.ee.washington.edu/tial/projects/simplification/ your 117K is a filtered version of this. I don't have any connection to this.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bharatambati/sent-compl/issues/2#issuecomment-634902583, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAV3MZTAV7GZQUOSGV4W3X3RTVVAFANCNFSM4NJTOQAQ .
thanks to everyone.
Hello, I would like download the sentences used to train and test the system in a plain-text format. I tried to look into the "data" folder but there are only numbers without text.
best