bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
456 stars 116 forks source link

Create dataset loader for Conflate Dataset #163

Open galtay opened 2 years ago

galtay commented 2 years ago
trishalaneeraj commented 2 years ago

self-assign

jason-fries commented 2 years ago

Hi @trishalaneeraj can you let us know if you are still working on this so we can update our project board? Please just notify us the status by Friday April 8. You can response to this comment or ping us on Slack or Discord.

No worries if you are not finished but still intend to work on this!

trishalaneeraj commented 2 years ago

Hi @jason-fries, I'm still working on this

jason-fries commented 2 years ago

Hi @trishalaneeraj Just a ping on the status of this dataset. Please let us know if you are still working on it and when you plan to submit a PR. Thanks!!

trishalaneeraj commented 2 years ago

Hi @jason-fries, I had some trouble reading in the data during my first attempt and I haven't had a chance to revisit it. The files are gzipped XML files. Any guidance on the best way to work with them?