OpenTermsArchive / engine

Tracks contractual documents and exposes changes to the terms of online services.
https://opentermsarchive.org
European Union Public License 1.2
114 stars 30 forks source link

Missing whitespace when generating version from PDF #836

Closed martinratinaud closed 2 years ago

martinratinaud commented 2 years ago

While watching at changes received by emails, I came across this particular pdf file present in france (See https://github.com/OpenTermsArchive/france-declarations/blob/main/declarations/Decathlon.json )

https://www.decathlon.fr/static/2019/LP/services/global-services/V21/assets/cgv.pdf

It generates a version without white spaces in some places https://github.com/OpenTermsArchive/france-versions/commit/91b6c1f

@MattiSG -not sure what kind of label I should use though, maybe parsing?

MattiSG commented 2 years ago

This bears some similarities to https://github.com/ambanum/OpenTermsArchive/issues/752.

The first step should be to check for updates or pending issues in @accordproject, and to open an issue there if none match.

MattiSG commented 2 years ago

(As a side note, for the label question, I'm not sure to see the benefit in adding a specific label for this at the moment, the number of issues is manageable as it is and grouping them by technical component will unfortunately not make it easier to solve IMO 😅)

martinratinaud commented 2 years ago

Here is what I've done so far.

TLDR: It is a problem when converting from pdf to HTML (and not from HTML to MD)

We have to consider the following also

My opinion is that we let ourselves 2 days to see if the issues I created have an answer, if not, we try another html to pdf and see if it works.

@MattiSG @Ndpnt if you have any other idea, please shoot

MattiSG commented 2 years ago

Thanks @martinratinaud for your investigation, recap and opening of issues to dependents! Let's hope @accordproject has a quick fix for this 🙂 🤞

martinratinaud commented 2 years ago

Fix has been made and it is working correctly 🎉

I need to update the tests though as output will be slightly different for all pdf files now