etalab-ia / piaf-ml

PIAF v2.0 repo for ML development. Main purpose of this repo is to automatically find the best configuration for a QA pipeline of a partner organisation.
MIT License
8 stars 0 forks source link
ia nlp piaf question-answering

PIAF ML

This project is conducted by the Lab IA at Etalab.
The aim of the Lab IA is to help the french administration to modernize its services by the use of modern AI techniques.
Other Lab IA projects can be found at the main GitHub repo. In particular, the repository for the PIAF annotator can be found here

-- Project Status: [Active]

PIAF

PIAF is an Opensource project aiming at providing the community with an easily activable French Question Answering Solution. The first use case of the PIAF project will be

The objective of this repository is to give tools for the following tasks:

Methods Used

The code for PIAF Agent and PIAF bot are hosted on the following repositories:

The code for the Haystack librairy can be found on the Deepset.ai repository

Treat data for the knowlgedge base

One of the goal of this repository is to generate the json files that compose the knowledge base.

Evaluate performances for the stack PIAF

For now, the main use of this repo is for evaluation. The goal of the evaluation is to assess the performance of the PIAF configuration on a test_dataset for which the fiches to be retrieved in the knowledge_base are known.

Needs of this project [TODO]

Performance evaluation

The procedure to start the evaluation script is the following:

  1. Prepare your knowledge base in the form of a json file formated with the squad format. More information regarding the format of the file can be found here
  2. Define a set of experiment parameters with the src/evaluation/config/retriever_reader_eval_squad.py
    parameters = {
    "k_retriever": [20,30],
    "k_title_retriever" : [10], # must be present, but only used when retriever_type == title_bm25
    "k_reader_per_candidate": [5],
    "k_reader_total": [3],
    "retriever_type": ["title"], # Can be bm25, sbert, dpr, title or title_bm25
    "squad_dataset": ["./data/evaluation-datasets/tiny.json"],
    "filter_level": [None],
    "preprocessing": [True],
    "boosting" : [1], #default to 1
    "split_by": ["word"],  # Can be "word", "sentence", or "passage"
    "split_length": [1000],
    "experiment_name": ["dev"]
    }
  3. Run:
    python - m src.evaluation.retriever_reader.retriever_reader_eval_squad.py
  4. If ES throws some errors, try re-running it again. Sometimes the docker image takes time to initialize.
  5. Note that the results will be saved in results/ in a csv form. Also, mlruns will create a record in mlruns

Project folder structure

/piaf-ml/
├── clients # Client specific deployment code
├── logs # Here we will put our logs when we get to it :)
├── notebooks # Notebooks with reports on experimentations
├── results # Folder were all the results generated from evaluation scripts are stored
├── src
│   ├── data # Script related to data generation
│   ├── evaluation # Scripts related to pipeline performance evaluation
│   │   ├── config # Configuration files
│   │   ├── results_analysis
│   │   ├── retriever # Scripts for evaluating the retriever only
│   │   ├── retriever_reader # Scripts for evaluating the full pipeline
│   │   └── utils # Somes utils dedicated to performance evaluation
│   └── models # Scripts related to training models
└── test # Unit tests

Set environment variables

Certain capabilities of this codebase (e.g., using a remote mlflow endpoint) need a set of environment variables to work properly. We use python-dotenv to read the contents of a .env file that sits at the root of the project. This file is not tracked by git for security reasons. Still, in order for everything to work properly, you need to create such a file in your local code, again, at the root of the project, such as piaf-ml/.env.

A template which describes the different environment variables is provided in .env.template. Copy it to .env and edit it to your needs.

Mlflow Specific Configutation

To be able to upload artifacts into mlflow, you need to be able to ssh into the designated artifact server via a ssh key. Also, you need a local ssh config that specifies an identity file for the artifact-server domain. Such as:

Host your.mlflow.remotehost.adress
    User localhostusername
    IdentityFile ~/.ssh/your_private_key

This requirement is needed when using sftp as your artifact endpoint protocol.

How to deploy PIAF

If you already published the docker images to https://hub.docker.com/

How to publish the elasticsearch docker image

This step is the most difficult : from downloading the latest version of service-public.fr XML files, we will publish a docker image of an Elasticsearch container in which we already injected all the service-public texts.

This can be done on your laptop (preferably not on the production server as it pollutes the )

How to publish the piafagent docker image

Follow README.md on the PiafAgent repo

Contributing Lab IA Members

Team Contacts :

Past Members :

How to contribute to this project

We love your input! We want to make contributing to this project as easy and transparent as possible : see our contribution rules

Contact