bukosabino / justicio

Building an assistant for Boletin Oficial del Estado (BOE) using Retrieval Augmented Generation (RAG)
MIT License
76 stars 24 forks source link

Scrapping BOE and legal validity of documents #42

Closed llrs closed 9 months ago

llrs commented 9 months ago

The tool uses the xml version of the documents in the BOE gazette, which is only the valid and legal binding format since 01/01/2009. How do you deal with the laws from previous years?

In addition, the metadata of references to and from other laws provided by BOE it is not complete. I haven't seen anything to check that in the scrapper file. How does it deal with it?

adantart commented 9 months ago

Hi @llrs

First of all, thank you very much for your interest in the project. We are excited to create an open and active community on the democratization of the use of legal information helping different profiles (professionals and citizens) to an easy access.

I will answer your two questions.

1) There are XML of the BOE since many years ago. Although in the first decades (1960-1980) the BOE was not digitized and therefore, not all the information has been extracted from its content, there are XML with at least metadata, since its creation. The full content to XML format was added (I believe) shortly before 2000. For example you have this BOE of February 2000, where is all the content (inside the tag): https://boe.es/diario_boe/xml.php?id=BOE-A-2000-6880

The best of all is that, if there is any "old" BOE (before 2000) with relevant information (for example the base of the civil or penal code), it is digitized and the complete information of the content appears in this tag.

2) About the cross-references, in all the BOE that have references to others (which are practically all of them), they are in the . There inside you will see that there are previous references marked with "", and others later. In each of them you will see the detail of the reference to another BOE and also the character of such reference: if it is a law that repeals, that modifies, that quotes or that dictates another document.

Do not hesitate to contact us with any other questions, we will be happy to help you!

llrs commented 9 months ago

I'm happy too that more effort is given to the laws instead of only jurisprudence.

Thanks, this kind of answers the questions.

  1. The XML files are not verified with the pdfs or the printed copies (via OCR).
  2. The program doesn't check if there are references in the text to previous documents/laws not provided in the analisis:referencias. Good luck!
adantart commented 9 months ago

Hi again !

1) But as I explained it is not a problem because the most important texts are well "extracted" from the original PDF or paper copies. There is "a lot of BOE" that is not necessary for the exercise. But good point to keep in mind ;-)

2) Right, it is something that the program will see that "spits" through the API for the frontend to make use of that information. The API endpoint sends all the RAG information and LLM output to the frontend. There you can play with this information as you wish ;-)