This project is conducted by the Lab IA at Etalab.
The aim of the Lab IA is to help the french administration to modernize its services by the use of modern AI techniques.
Other Lab IA projects can be found at the main GitHub repo. In particular, the repository for the PIAF annotator can be found here
PIAF is an Opensource project aiming at providing the community with an easily activable French Question Answering Solution. The first use case of the PIAF project will be
The objective of this repository is to give tools for the following tasks:
The PIAF solution is using the following architecture : [Mettre le dessin architecture ici]
The code for PIAF Agent and PIAF bot are hosted on the following repositories:
The code for the Haystack librairy can be found on the Deepset.ai repository
One of the goal of this repository is to generate the json files that compose the knowledge base.
For now, the main use of this repo is for evaluation. The goal of the evaluation is to assess the performance of the PIAF configuration on a test_dataset
for which the fiches
to be retrieved in the knowledge_base
are known.
frontend developers
data exploration/descriptive statistics
data processing/cleaning
statistical modeling
writeup/reporting
etc. (be as specific as possible)
sudo apt install gcc make python3-dev
Using pip:
pip install -r requirements.txt
on Windows : pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html
Using conda
conda env create --name envname --file=environment.yml
The procedure to start the evaluation script is the following:
parameters = {
"k_retriever": [20,30],
"k_title_retriever" : [10], # must be present, but only used when retriever_type == title_bm25
"k_reader_per_candidate": [5],
"k_reader_total": [3],
"retriever_type": ["title"], # Can be bm25, sbert, dpr, title or title_bm25
"squad_dataset": ["./data/evaluation-datasets/tiny.json"],
"filter_level": [None],
"preprocessing": [True],
"boosting" : [1], #default to 1
"split_by": ["word"], # Can be "word", "sentence", or "passage"
"split_length": [1000],
"experiment_name": ["dev"]
}
python - m src.evaluation.retriever_reader.retriever_reader_eval_squad.py
results/
in a csv form. Also, mlruns will create a record in mlruns
/piaf-ml/
├── clients # Client specific deployment code
├── logs # Here we will put our logs when we get to it :)
├── notebooks # Notebooks with reports on experimentations
├── results # Folder were all the results generated from evaluation scripts are stored
├── src
│ ├── data # Script related to data generation
│ ├── evaluation # Scripts related to pipeline performance evaluation
│ │ ├── config # Configuration files
│ │ ├── results_analysis
│ │ ├── retriever # Scripts for evaluating the retriever only
│ │ ├── retriever_reader # Scripts for evaluating the full pipeline
│ │ └── utils # Somes utils dedicated to performance evaluation
│ └── models # Scripts related to training models
└── test # Unit tests
Certain capabilities of this codebase (e.g., using a remote mlflow endpoint) need a set of environment variables to work properly. We use python-dotenv
to read the contents of a .env
file that sits at the root of the project. This file is not tracked by git for security reasons. Still, in order for everything to work properly, you need to create such a file in your local code, again, at the root of the project, such as piaf-ml/.env
.
A template which describes the different environment variables is provided in .env.template
. Copy it to .env
and edit it to your needs.
To be able to upload artifacts into mlflow, you need to be able to ssh
into the designated artifact server via a ssh
key. Also, you need a local ssh
config that specifies an identity file for the artifact-server domain. Such as:
Host your.mlflow.remotehost.adress
User localhostusername
IdentityFile ~/.ssh/your_private_key
This requirement is needed when using sftp
as your artifact endpoint protocol.
docker-compose up
✅This step is the most difficult : from downloading the latest version of service-public.fr XML files, we will publish a docker image of an Elasticsearch container in which we already injected all the service-public texts.
This can be done on your laptop (preferably not on the production server as it pollutes the )
CONTRIBUTING.md
Dockerfile-GPU
MANIFEST.in
annotation_tool
docker-compose.yml
haystack
requirements.txt
run_docker_gpu.sh
test
tutorials
Dockerfile
LICENSE
README.md
data
v14 # here you should now see your JSONs
docs
models
rest_api
setup.py
docker-compose up
curl -XDELETE http://localhost:9200/document_elasticsearch
(if you forget to do this, you will add your document to the exisiting ones, making a BIG database lol)docker container logs -f haystack_haystack-api_1
(note that the container name can change, better verifying it by typing docker container ls
)pip install ipython
ipyhton
docker commit 829ed24c0d1b guillim/spf_particulier:v15
but don't forget to replace 829ed24c0d1b
by the ID of the elasticsearch container you can have typing docker container ls
docker push guillim/spf_particulier:v15
✅Follow README.md on the PiafAgent repo
Team Contacts :
Past Members :
We love your input! We want to make contributing to this project as easy and transparent as possible : see our contribution rules