bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
452 stars 114 forks source link

Closes #927 #928

Closed kai-car closed 1 month ago

kai-car commented 1 month ago

Current version of drugprot.py only includes splits train and validation. For this reason, I adjusted the drugprot.py data loading script to also load the test_background split, as the .tsv files are already present in the data folder. Note that the test_background split does not have any relations.

See also HuggingFace pull request: https://huggingface.co/datasets/bigbio/drugprot/discussions/1/files

phlobo commented 1 month ago

Thank your for the PR! The loader script that needs to be adapted is the one under hub_repos though.

Please take a look at the contribution guide, where you can also find how to format the code and execute tests (currently, the test output doesn't reflect your changes).

kai-car commented 1 month ago

Hi, thanks for the feedback, I adjusted accordingly. This time, I properly followed the steps and now it should work. 👍

phlobo commented 1 month ago

Thank you for your changes!

I'm getting the following error running the unit tests:

AssertionError: Dataloader attribute 'Creative Commons Attribution 4.0 International' not valid for _LICENSE must be one of {'GPL_2p0_WITH_BISON_EXCEPTION', 'PDDL_1p0', ...}

It's not related to your fix, but could you please add the correct license key in your PR? I guess it should be CC_BY_4p0

Also, I still see some differences when running black, did you run the formatting (https://github.com/bigscience-workshop/biomedical/blob/main/CONTRIBUTING.md#5-format-your-code)?

kai-car commented 1 month ago

Hope the code adjustments fix the problems. :)