Closed johann-petrak closed 3 years ago
Sadly pandas only complains about what it did not get, does not say what it got, but when I check the values of variables
columns_needed
and rename_columns
I see:
columns_needed: ['coarse', None, 'fine']
rename_columns {'coarse': 'coarse_label', None: 'text', 'fine': 'fine_label'}
header: 0
So for some reason this passes None where probably "text" would be needed?
Note that when I add the parameters for the coarse task directly to the TextClassificationProcessor constructor and only add the fine task afterwards, the columns_needed
will contain text
AND None in addition to coarse
and fine
.
So, am I doing this wrong, how is this supposed to work?
I know we have a good issue on MTL here: https://github.com/deepset-ai/FARM/issues/724 Unfortunately I dont have access to the linked colab notebook that should give a good example of MTL in FARM. Maybe you can ask the author to give access and maybe also write an MTL example in FARM? :smile:
If you cannot find the information there @tholor could you look into the tasks and MTL setup described here, please?
I did not add to #724 because it seems to be about the actual learning, while my problem is already related to just loading the data for it. I was hoping that the author(s) who designed the data loading strategy for multiple tasks could have a look at this. Once the data can be loaded I will be happy to share the actual MTL example if I get it to run (which I really need to happen).
I remember when I still had access to the linked colab in #724 there was also the data loading part covered... and the author wanted to create an example script for MTL.
Our "tasks" setup is not explicitly tested by us for MTL, we only used BertStyleLMProcessor for MTL preprocessing in one processor. @tholor might be able to help on how to use "tasks"
Is there an issue or document that describes the design for the whole MTL process, including the data management and the actual training/inference part? It is a bit hard to start with just an example and the source code.
In FARM, unfortunately there is no such document. We did not work much with MTL in FARM to be honest...
OK so when trying to add all possible parameters to the add_task invocation like this:
mtl_processor.add_task(name="coarse",
task_type="classification",
label_list=LABEL_LIST_COARSE,
metric="acc",
text_column_name="text",
label_column_name="coarse")
mtl_processor.add_task(name="fine",
task_type="classification",
label_list=LABEL_LIST_FINE,
metric="acc",
text_column_name="text",
label_column_name="fine")
Creating the data silo works without exception. This is one of the many cases where we need to update the documentation to 1) include information about defaults and include which of several kwargs are required.
BTW, if the task_type
parameter is set to an incorrect value (e.g. "text_classification") then the following exception occurs instead of an error message informing about allowed values:
Preprocessing Dataset ../data/germeval2019_ALL_cleaned.tsv: 0%| | 0/15459 [00:00<?, ? Dicts/s]
---------------------------------------------------------------------------
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/johann/software/anaconda/envs/farm-dev/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/data/johann/work-git/FARM-forked/farm/data_handler/data_silo.py", line 132, in _dataset_from_chunk
dataset, tensor_names, problematic_sample_ids = processor.dataset_from_dicts(dicts=dicts, indices=indices)
File "/data/johann/work-git/FARM-forked/farm/data_handler/processor.py", line 654, in dataset_from_dicts
label_dict = self.convert_labels(dictionary)
File "/data/johann/work-git/FARM-forked/farm/data_handler/processor.py", line 699, in convert_labels
ret[task["label_tensor_name"]] = label_ids
UnboundLocalError: local variable 'label_ids' referenced before assignment
"""
The above exception was the direct cause of the following exception:
UnboundLocalError Traceback (most recent call last)
<ipython-input-16-8af0dfbe6ff0> in <module>
1 BATCH_SIZE = 32
2
----> 3 data_silo = DataSilo(
4 processor=mtl_processor,
5 batch_size=BATCH_SIZE)
/data/johann/work-git/FARM-forked/farm/data_handler/data_silo.py in __init__(self, processor, batch_size, eval_batch_size, distributed, automatic_loading, max_multiprocessing_chunksize, max_processes, caching, cache_path)
111 # In most cases we want to load all data automatically, but in some cases we rather want to do this
112 # later or load from dicts instead of file (https://github.com/deepset-ai/FARM/issues/85)
--> 113 self._load_data()
114
115 @classmethod
/data/johann/work-git/FARM-forked/farm/data_handler/data_silo.py in _load_data(self, train_dicts, dev_dicts, test_dicts)
220 train_file = self.processor.data_dir / self.processor.train_filename
221 logger.info("Loading train set from: {} ".format(train_file))
--> 222 self.data["train"], self.tensor_names = self._get_dataset(train_file)
223 else:
224 logger.info("No train set is being loaded")
/data/johann/work-git/FARM-forked/farm/data_handler/data_silo.py in _get_dataset(self, filename, dicts)
183 desc += f" {filename}"
184 with tqdm(total=len(dicts), unit=' Dicts', desc=desc) as pbar:
--> 185 for dataset, tensor_names, problematic_samples in results:
186 datasets.append(dataset)
187 # update progress bar (last step can have less dicts than actual chunk_size)
~/software/anaconda/envs/farm-dev/lib/python3.8/multiprocessing/pool.py in next(self, timeout)
866 if success:
867 return value
--> 868 raise value
869
870 __next__ = next # XXX
UnboundLocalError: local variable 'label_ids' referenced before assignment
OK, turns out this is not a bug.
Thanks for self serving here @johann-petrak :smile: Which settings solved the issue then?
My first attempt was to specify the text column in the constructor and then add a task using
mtl_processor.add_task(name="coarse", label_list=LABEL_LIST_COARSE, metric="acc", label_column_name="coarse")
(this was suggested somewhere in some release notes, I think).
The second attempt, which worked, was to instead include ALL possible parameters for the add_task method:
mtl_processor.add_task(name="coarse",
task_type="classification",
label_list=LABEL_LIST_COARSE,
metric="acc",
text_column_name="text",
label_column_name="coarse")
mtl_processor.add_task(name="fine",
task_type="classification",
label_list=LABEL_LIST_FINE,
metric="acc",
text_column_name="text",
label_column_name="fine")
I did not systematically try to find out which of the parameters I added actually was/were the crucial one(s), but I assume task_type should not be missing.
Nice thanks, hopefully other people find this info as well.
Once I get the MTL example to run through, I will add it to the examples dir which I hope should help.
I am trying to get started with using FARM for multi task learning (for now, just two simple classification heads, but eventually I want to implement my own head layers for one of the two).
I am using the latest git master branch at bec0a9a for this.
Sadly I could not find any example or instructions for how to do this so I tried this code, which feels like the most logical approach - first define the things common to both heads (e.g. the text column) in the TextClassificationProcessor, then add the specific fields for each task:
This throws the following exception: