Missing eng.pdtb.gum data

disrpt / sharedtask2023

Repository for DISRPT2023 shared task

16 stars 3 forks source link

Missing eng.pdtb.gum data #1

Closed zaemyung closed 7 months ago

zaemyung commented 7 months ago

Hi,

Thanks for maintaining the repository. It's a valuable resource for the community. I've encountered an issue while attempting to process underscores within the GUM corpora. Specifically, the script seems to miss files located in eng.pdtb.gum, resulting in underscores being left unprocessed. Could you please look into this matter?

Thanks!

amir-zeldes commented 7 months ago

Thanks for reporting, I see the issue... It's actually easy to add eng.pdtb.gum to the underscores script by adding this line at the bottom:

restore_docs(os.sep.join(["..","data","eng.pdtb.gum"]),docs2text)

But that only restored Reddit data. The rest of the tokens from non-Reddit data should be included in the release, but for some reason everything has underscores. @laura-riviere12 do you know where this issue comes from? I would upload the data but I'm not sure what version is current on the repo. If you can add the non-underscored data directly to the repo for all GUM genres except Reddit, I can just add that line to the end of process_underscores.py and it should work (or you can also add it if you like).

laura-riviere12 commented 7 months ago

Hi @zaemyung , can you tell me which release are you dealing with please ? You can use https://github.com/disrpt/sharedtask2023/releases/tag/v1.0 if you want the data of the 2023 shared task. We are currently working on an updated version of the data, with also new corpora. The script to replace the underscore is not yet ready to deal with both version of GUM (RST and PDTB). A new clean repository will be soon available with all these updates, along with the publication of our new paper (@ LREC 2024).

zaemyung commented 7 months ago

Hi @laura-riviere12 I was using the latest main branch as I saw there were some updates since the v1.0. Thanks for the clarification! And congrats on your new paper, looking forward to reading it.

amir-zeldes commented 7 months ago

I see, thanks @laura-riviere12 - let me know if you need anything from my end.

@zaemyung if you don't need the exact version from the shared task but prefer the most up-to-date/biggest data, you can also use the official GUM 10 release, which has more tokens and data from four additional genres:

https://github.com/amir-zeldes/gum/tree/master/rst/disrpt

There is a script for reconstructing the Reddit data for that in the same repo, see here for instructions. Similar data can also be built for 8 more 'extreme' genres, which have a small amount of documents per genre for out-of-domain testing purposes, in a separate corpus called GENTLE. For more info on GUM/GENTLE see the website.

zaemyung commented 7 months ago

@amir-zeldes Perfect, thanks!!