JohnSnowLabs / langtest

Deliver safe & effective language models
http://langtest.org/
Apache License 2.0
488 stars 36 forks source link

Feature/add support for other file formats #993

Closed chakravarthik27 closed 5 months ago

chakravarthik27 commented 5 months ago

Description

This pull request significantly improves the data source handling capabilities in our BaseDataset and DataFactory classes.

Key updates:

Dynamic Data Source Support in BaseDataset: The __init_subclass__ method of the BaseDataset class has been improved to dynamically accommodate different data sources based on the class name. When a class name matches a pandas read method (such as read_excel or read_json), the class is immediately registered as a data source for the appropriate file extensions. This simplifies the process of adding support for additional data sources by simply defining new subclasses of BaseDataset.

Enhanced Data Source Mapping in DataFactory: The DataFactory class now uses the BaseDataset's data_sources dictionary to instantiate the correct Dataset type based on the file extension. As a result, each data source's class consumption is precisely controlled.

These changes improve the flexibility and maintainability of our data processing code. Adding support for new data sources is now easier, requiring simply the creation of new subclasses of BaseDataset, eliminating the need for manual modifications to the DataFactory class.