bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
447 stars 114 forks source link

Closes #854 #855

Open Miking98 opened 1 year ago

Miking98 commented 1 year ago

Add the Paragraph-Level Simplification of Medical Texts dataset. Closes #854

Checkbox

galtay commented 1 year ago

@Miking98 thanks for this contribution! we are in the middle of updating our contribution guidelines to support hub datasets. Can I ask that we hold off on merging this until the new guidelines are published and/or can you update your PR to include an implementation in the hub_repos directory?

this is the PR that has the new contribution guidelines https://github.com/bigscience-workshop/biomedical/pull/850

and this is an example of a PR contributing code to the hub_repos directory (but it wont be easily testable until the PR above is merged) https://github.com/bigscience-workshop/biomedical/pull/852

Miking98 commented 1 year ago

Thanks for the note @galtay, makes sense! Will hold off until the new guidelines are published in that case, then will revise and submit a new pull request once updated to abide by them. Thanks!

galtay commented 1 year ago

hello @Miking98 thanks for your patience! we have a new CONTRIBUTING.md file now (https://github.com/bigscience-workshop/biomedical/blob/main/CONTRIBUTING.md) and I was wondering if you'd help us try it out. Please ping me if there are any issues and I'll help get this dataset loader in.

Miking98 commented 1 year ago

Thanks for the note @galtay ! I just went through the revised Contributing doc and updated my pull request accordingly -- please let me know your thoughts