Closed Rob192 closed 3 years ago
Here is the file we need to work on
Hey @Rob192 @psorianom, I am very willing (and excited) to set up a new instance of Haystack with a custom Piaf_agent for CRPA this week. In that concern, it would really help me if you could help creating the SQuAD format version fo the excel file mentionned above ! What do you think ?
Hey @guillim,
Maybe it's easier if we discuss our TODO in a dedicated meeting ;) On my side, I will be working on the report for the performance of the haystack on the DILA dataset. That's all the time I can dedicate to PIAF this week.
For generating the squad dataset I was thinking that you would need to scrap legifrance but maybe @psorianom knows other ways to extract data from LEGIFRANCE ?
For Guides Etalab, you could use the data here : https://guides.etalab.gouv.fr/pdf.html
Yes. As a first step this week, I was thinking to only integrate the excel sheet, without going further with all the scrapping. We can definitely schedule a call this afternoon to discuss our task for the week !
just re-opening this issue while we haven't created the task mentionned by Robin in the description
Question @Rob192 : about guide.etalab
According to what I recall from the discussion with Perica : for him most questions they are asked can be answered with these two knowledge bases (crpa + guide). It makes sense to have only one contact point with all the information for the final user.
Normally our retriever should be able to sort the information based on the question. If not, we could use filters. I am not aware of a haystack functionality that allows working with two ES or two indexes. I think the easiest is to use the filter functionality If needed.
We are talking only about guides.etalab/juridique, right? I believe the guides synthetize the concerned parts of the CRPA, so it may also be interesting to mix them and see how the retriever finds answers from the CRPA and from the guide and how they complement each other (or not) :thinking:
According to what I recall from the discussion with Perica : for him most questions they are asked can be answered with these two knowledge bases (crpa + guide). It makes sense to have only one contact point with all the information for the final user.
@psorianom makes a good point : is it only the page juridique ?
Normally our retriever should be able to sort the information based on the question. If not, we could use filters. I am not aware of a haystack functionality that allows working with two ES or two indexes. I think the easiest is to use the filter functionality If needed.
From my experience with Haystack, we simply need to have two ES nodes, and join the documents afterwards. shouldn't be that difficult. But (following previous remark) if it's only one guide, everything should fit I believe.
I have checked in detail what integrating "legifrance" and "Guide Etalab" is about. From what I understand, the work that Cindy and Perica did is
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
Question | Réponse | Article Code | Lien Légifrance | Lien guide | remarques | public cible -- | -- | -- | -- | -- | -- | -- Qu'est-ce que l'open data ? | L'open data public consiste à assurer la large mise à disposition à tous de ces données, en accès libre et gratuit, sous un format numérique facilement réutilisable. | | | https://guides.etalab.gouv.fr/juridique/opendata/ | | usager et adm
Work on the dataset provided by Cindy Kus to experiment on CRPA
This means :