Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.44k stars 692 forks source link

feat/skip_strikethrough parameter #3569

Open arisjr opened 2 weeks ago

arisjr commented 2 weeks ago

Is your feature request related to a problem? Please describe. Yes. I'm doing a RAG on a group of brazilian laws and I think that the problem applies to all RAG/LLM community. (I'm new to RAG)

Law and general legislation publications and documents that need to keep track of changes (history) normally don't simply erase text, they strikethrough the text, like the examples below:

https://www.planalto.gov.br/ccivil_03/_ato2004-2006/2006/decreto/d5948.htm https://www.justice.gov/oip/freedom-information-act-5-usc-552

I think that including strikethrough text on data may lead to false assumptions by the AI, leading to wrong results for the analyst.

Describe the solution you'd like Add skip_strikethrough parameter on partition_html class

Describe alternatives you've considered None

Additional context I'm using langchain unstructuredHTMLLoader.

arisjr commented 2 weeks ago

The partition_html class seems to be the easiest implementation for this, because HTML has explicit tags for strikethrough text.

It would be very nice to, after implementing on HTML, extend to all other documents supported by unstructured project, as PDF (this one may be more difficult).