Closed alex-hh closed 13 hours ago
TODO: we need from_yaml_dict to load the bio features this is how the info gets loaded from the dataset card. then there is possibly a second from_arrow_schema step - not sure when this applies exactly.
if we override Features then from_dataset_card_data doesn't need overriding: this fixes all of the dataset card stuff. However, we may also need to override arrow schema saving and loading. we also definitely need to override Features encode_example and decode_example.
DatasetInfosDict.from_dataset_card_data gets invoked in module factory. DatasetInfosDict.from_dataset_card_data() invokes from_yaml_dict. DatasetInfo.from_yaml_dict needs to load the biodatasets thing as well as the huggingface thing. to_dataset_card_data calls DatasetInfo._to_yaml_dict these DatasetInfo methods call Features _to_yaml_list and Features _from_yaml_list we probably do need to override datasets.Features.
basically just modifying from from yaml list, to yaml list, from arrow schema and to arrow schema should handle everything
it's tricky to get everything working with anything other than a direct features override - Features is hardcoded in InfosDict, which in turn is hard coded in module factory
The point is that there are two cases:
this also happens in Dataset init
inferred_features = Features.from_arrow_schema(arrow_table.schema)
load_dataset_builder builder_instance.as_streaming_dataset builder_instance.download_and_prepare builder_instance.as_dataset