This pull request includes several changes to the langtest/datahandler/datasource.py and langtest/tasks/task.py files to add support for Spark datasets and improve the handling of file extensions. The most important changes include the addition of a new SparkDataset class and modifications to the __init__ and load methods to accommodate the new dataset type.
Support for Spark datasets:
Added a new SparkDataset class to handle Spark datasets, including methods for loading raw data and preprocessed data, and initializing a Spark session. (langtest/datahandler/datasource.py)
Improvements to file extension handling:
Updated the __init__ method to set file_ext from the file_path dictionary if provided, and adjusted the logic to handle cases where the source key is present. (langtest/datahandler/datasource.py) [1][2]
Modified the load method to include "spark" as a valid file extension for initializing the data source class. (langtest/datahandler/datasource.py)
Enhancements to task handling:
Updated the create_sample method to ensure labels are converted to strings before being added to the list. (langtest/tasks/task.py)
This pull request includes several changes to the
langtest/datahandler/datasource.py
andlangtest/tasks/task.py
files to add support for Spark datasets and improve the handling of file extensions. The most important changes include the addition of a newSparkDataset
class and modifications to the__init__
andload
methods to accommodate the new dataset type.Support for Spark datasets:
SparkDataset
class to handle Spark datasets, including methods for loading raw data and preprocessed data, and initializing a Spark session. (langtest/datahandler/datasource.py
)Improvements to file extension handling:
__init__
method to setfile_ext
from thefile_path
dictionary if provided, and adjusted the logic to handle cases where thesource
key is present. (langtest/datahandler/datasource.py
) [1] [2]load
method to include "spark" as a valid file extension for initializing the data source class. (langtest/datahandler/datasource.py
)Enhancements to task handling:
create_sample
method to ensure labels are converted to strings before being added to the list. (langtest/tasks/task.py
)