deepset-ai / FARM

:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.
https://farm.deepset.ai
Apache License 2.0
1.74k stars 247 forks source link

Recommended way to evaluate a loaded model on a file? #775

Closed johann-petrak closed 3 years ago

johann-petrak commented 3 years ago

I am training a model in one python file/process and I am saving the processor and the model to the same directory:

# initialize processor for the tasks
processor.save("mymodel")
# setup and carry out training on the training file defined for the processor earlier (test file is None)
model.save("mymodel")

In a different program, I want to load that model and use it for inference and evaluation. For this I restore the model into an inferencer:

inferencer = Inferencer.load("mymodel", ....)

Now I would also like to use that Inferencer for evaluation on some data file.

I am doing:

processor = ... get a Processor here from the Inferencer+modify or make a new one

# create the silo
data_silo = DataSilo(
    processor=processor,
    batch_size=BATCH_SIZE)
evaluator = Evaluator(
            data_loader=data_silo.get_data_loader("test"),
            tasks=processor.tasks,
            device=device
        )
result = evaluator.eval(inferencer_loaded.model, return_preds_and_labels=True)
evaluator.log_results(result, "Test", steps=len(data_silo.get_data_loader("test")))

Is this the recommended way for how to do it or is there a better way?

When I run this I get the following message very often:

To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)

What has been done wrong to cause this?

Timoeller commented 3 years ago

I think its not bad loading the model by using the Inferencer. You could also just use the AdaptiveModel.load function directly.

Avoid using tokenizers before the fork if possible

This warning happens with fast tokenizers using Rust Multithreading and FARM based python multiprocessing running at the same time. If the code runs I would ignore this warning. Otherwise you could set max_processes=1 in the data_silo constructor to disable FARM multiprocessing.

johann-petrak commented 3 years ago

OK, so if in doubt, switch off FARM MP rather than Rust MP? (I am a bit scared of deadlocks when using this in production eventually)

Timoeller commented 3 years ago

Exactly, when in doubt, switch off FARM MP.

Rust MT is an incredible speed boost on the tokenization side, especially for large texts. The FARM MP is not really needed any more with the fast tokenizers. We havent seen any deadlocks with the combination yet, thats why we kept both methods turned on. If you should encounter problems though we can think of disabling FARM MP by default.

Timoeller commented 3 years ago

Seems resolved, closing now. Feel free to reopen.