Missing whitespace when generating version from PDF

martinratinaud commented 2 years ago

While watching at changes received by emails, I came across this particular pdf file present in france (See https://github.com/OpenTermsArchive/france-declarations/blob/main/declarations/Decathlon.json )

https://www.decathlon.fr/static/2019/LP/services/global-services/V21/assets/cgv.pdf

It generates a version without white spaces in some places https://github.com/OpenTermsArchive/france-versions/commit/91b6c1f

@MattiSG -not sure what kind of label I should use though, maybe parsing?

MattiSG commented 2 years ago

This bears some similarities to https://github.com/ambanum/OpenTermsArchive/issues/752.

The first step should be to check for updates or pending issues in @accordproject, and to open an issue there if none match.

MattiSG commented 2 years ago

(As a side note, for the label question, I'm not sure to see the benefit in adding a specific label for this at the moment, the number of issues is manageable as it is and grouping them by technical component will unfortunately not make it easier to solve IMO 😅)

martinratinaud commented 2 years ago

Here is what I've done so far.

TLDR: It is a problem when converting from pdf to HTML (and not from HTML to MD)

updated from 0.14.1 to 0.15.1
had an error on local and created an issue https://github.com/accordproject/markdown-transform/issues/501
bypassed this error by installing the missing package manually and got slightly different decoding but still not acceptable. Created an issue https://github.com/accordproject/markdown-transform/issues/502
Checked forks with this nice tool https://useful-forks.github.io but none is providing a fix to that
Tested with https://pdf2md.morethan.io/ and I get the same whitespaces problem (they use https://mozilla.github.io/pdf.js/)
Tested with https://pdf.online/convert-pdf-to-html and got a proper HTML formatting

We have to consider the following also

I don't think it's easy for us to propose a PR on this particular matter
there are many other libraries out there but not any that seems out of the league, the biggest one being not heavily maintained anymore. https://github.com/coolwanglu/pdf2htmlEX
Here are some other libraries https://www.npmtrends.com/@accordproject/markdown-pdf-vs-pdf-html-extract-vs-phantom-html2pdf
all online pdf to html websites seem to be formatting well but I don't find a way to know which project they're using (if you have an idea)

My opinion is that we let ourselves 2 days to see if the issues I created have an answer, if not, we try another html to pdf and see if it works.

@MattiSG @Ndpnt if you have any other idea, please shoot

MattiSG commented 2 years ago

Thanks @martinratinaud for your investigation, recap and opening of issues to dependents! Let's hope @accordproject has a quick fix for this 🙂 🤞

martinratinaud commented 2 years ago

Fix has been made and it is working correctly 🎉

I need to update the tests though as output will be slightly different for all pdf files now

OpenTermsArchive / engine

Missing whitespace when generating version from PDF #836