Closed nachollorca closed 1 year ago
Hej @nachollorca!
Thanks for implementing the data set. Generally the code looks fine for me. However, the source schema is relatively technical (~ an accurate representation of BioC). You could have opted for a more abstract scheme here. But this is still perfectly fine I think.
I just have one minor issue. Please run the code formatting one more time again:
make check_file=bigbio/hub/hub_repos/bronco/bronco.py
Best, Mario
@mariosaenger you are welcome!
Yes, I tried to keep the source schema identical to the provided .xml, which follows indeed BioC format. If you have any specific ideas in mind I can give them a shot.
I applied the formatting changes already.
All the best!
For many of the NER corpora we rely on a schema quite analogous to kb schema and (may) extend it with data set specific information. For BRONCO this could be something like:
"entities": [
{
"id": datasets.Value("string"),
"type": datasets.Value("string"),
"text": datasets.Sequence(datasets.Value("string")),
"offsets": datasets.Sequence([datasets.Value("int32")]),
"levelOfTruth": datasets.Value("string"),
"localisation": datasets.Value("string"),
"normalized": [
{
"db_name": datasets.Value("string"),
"db_id": datasets.Value("string"),
}
],
}
],
@mariosaenger is this ok to merge?
As far as I am concerned, it can be merged, unless @nachollorca wants to adapt the source scheme?
Ey @mariosaenger, @hakunanatasha. The changes are not difficult and I would definitively make them. I am only on a tight schedule at the moment. You can either merge them now and I'll submit another PR once I touch it up or wait until I have the time.
Closes #865 Implementation of a loader for BRONCO150 dataset. It presents two of the matters currently discussed in the Discord group: unary relations and K-fold train splits.
Checkbox
hub/hub_repos/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_BIGBIO_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneBigBioConfig
for the source schema and one for a bigbio schema.datasets.load_dataset
function.python -m tests.test_bigbio_hub <dataset_name> [--data_dir /path/to/local/data] --test_local
.