etalab-ia / piaf-ml

PIAF v2.0 repo for ML development. Main purpose of this repo is to automatically find the best configuration for a QA pipeline of a partner organisation.
MIT License
8 stars 0 forks source link

CRPA dataset enriched with Legifrance #44

Closed Rob192 closed 3 years ago

Rob192 commented 3 years ago

Work on the dataset provided by Cindy Kus to experiment on CRPA

This means :

guillim commented 3 years ago

Here is the file we need to work on

CRPA-Chatbot-base-QR.xlsx

guillim commented 3 years ago

Hey @Rob192 @psorianom, I am very willing (and excited) to set up a new instance of Haystack with a custom Piaf_agent for CRPA this week. In that concern, it would really help me if you could help creating the SQuAD format version fo the excel file mentionned above ! What do you think ?

Rob192 commented 3 years ago

Hey @guillim, Maybe it's easier if we discuss our TODO in a dedicated meeting ;) On my side, I will be working on the report for the performance of the haystack on the DILA dataset. That's all the time I can dedicate to PIAF this week.
For generating the squad dataset I was thinking that you would need to scrap legifrance but maybe @psorianom knows other ways to extract data from LEGIFRANCE ? For Guides Etalab, you could use the data here : https://guides.etalab.gouv.fr/pdf.html

guillim commented 3 years ago

Yes. As a first step this week, I was thinking to only integrate the excel sheet, without going further with all the scrapping. We can definitely schedule a call this afternoon to discuss our task for the week !

guillim commented 3 years ago

just re-opening this issue while we haven't created the task mentionned by Robin in the description

guillim commented 3 years ago

Question @Rob192 : about guide.etalab

Rob192 commented 3 years ago

According to what I recall from the discussion with Perica : for him most questions they are asked can be answered with these two knowledge bases (crpa + guide). It makes sense to have only one contact point with all the information for the final user.

Normally our retriever should be able to sort the information based on the question. If not, we could use filters. I am not aware of a haystack functionality that allows working with two ES or two indexes. I think the easiest is to use the filter functionality If needed.

psorianom commented 3 years ago

We are talking only about guides.etalab/juridique, right? I believe the guides synthetize the concerned parts of the CRPA, so it may also be interesting to mix them and see how the retriever finds answers from the CRPA and from the guide and how they complement each other (or not) :thinking:

guillim commented 3 years ago

According to what I recall from the discussion with Perica : for him most questions they are asked can be answered with these two knowledge bases (crpa + guide). It makes sense to have only one contact point with all the information for the final user.

@psorianom makes a good point : is it only the page juridique ?

Normally our retriever should be able to sort the information based on the question. If not, we could use filters. I am not aware of a haystack functionality that allows working with two ES or two indexes. I think the easiest is to use the filter functionality If needed.

From my experience with Haystack, we simply need to have two ES nodes, and join the documents afterwards. shouldn't be that difficult. But (following previous remark) if it's only one guide, everything should fit I believe.

guillim commented 3 years ago

I have checked in detail what integrating "legifrance" and "Guide Etalab" is about. From what I understand, the work that Cindy and Perica did is

  1. Select all question they had the last months
  2. Find the answer in legifrance, or guide Etalab, or Article Code de la loi
  3. Copy-paste. if not appropriate => write an answer.

For instance:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

Question | Réponse | Article Code | Lien Légifrance | Lien guide | remarques | public cible -- | -- | -- | -- | -- | -- | -- Qu'est-ce que l'open data ? | L'open data public consiste à assurer la large mise à disposition à tous de ces données, en accès libre et gratuit, sous un format numérique facilement réutilisable. |   |   | https://guides.etalab.gouv.fr/juridique/opendata/ |   | usager et adm

If we read this topic from guide etalab, here is what we can read :

Dans le cadre de ses missions de service public, l’administration produit et reçoit des documents administratifs. Ces documents administratifs peuvent contenir des informations publiques, qui peuvent elles-mêmes être représentées sous forme de données publiques. L'open data public consiste à assurer la large mise à disposition à tous de ces données, en accès libre et gratuit, sous un format numérique facilement réutilisable.

=> the last sentence is the answer.

So

I believe we can close this issue. since integrating the excel file is exaclty what was expected, from what i understand. What do you think ?

guillim commented 3 years ago

✉️ sent @C for better understanding

guillim commented 3 years ago

meeting scheduled with @C to clarify this topic

guillim commented 3 years ago

Conclusion after the call : only the Excel as the knowledge base.

Reason : Legifrance needs interpretation + enrichment with other corpus (impossible to understand for non-juridic people)