freedomofpress / dangerzone

Take potentially dangerous PDFs, office documents, or images and convert them to safe PDFs
https://dangerzone.rocks/
GNU Affero General Public License v3.0
3.69k stars 172 forks source link

HWP conversion fails on MacOS M1 #498

Closed apyrgio closed 11 months ago

apyrgio commented 1 year ago

HWP conversion was recently added in https://github.com/freedomofpress/dangerzone/pull/460 and was tested across various platforms (Windows, MacOS, Linux) through our CI. However, when running our tests on a MacOS M1 platform (which is not available through any of our CI runners), we get the following error:

[DEBUG] Marking doc UWYs15 as 'converting'
[INFO ] > /usr/local/bin/docker run --network none -u dangerzone --security-opt=no-new-privileges:true --cap-drop all --rm -v /var/folders/7c/49675rcx2nb0lwjms_6wmdpr0000gs/T/tmp6rzrqzt1/unsafe/input_file:/tmp/input_file:Z -v /var/folders
/7c/49675rcx2nb0lwjms_6wmdpr0000gs/T/tmp6rzrqzt1/pixels:/tmp/dangerzone:Z -e ENABLE_TIMEOUTS=1 dangerzone.rocks/dangerzone /usr/bin/python3 -m dangerzone.conversion.doc_to_pixels
[INFO ] [doc UWYs15] 0% UNTRUSTED> Installing LibreOffice extension 'h2orestart.oxt'
[INFO ] [doc UWYs15] 0% UNTRUSTED> Converting to PDF using LibreOffice                                 
[INFO ] [doc UWYs15] 3% UNTRUSTED> Calculating number of pages
[ERROR] [doc UWYs15] 3% UNTRUSTED> PDF file is corrupted

(taken from the dangerzone-cli)

If one tries to convert the file to PDF via Libreoffice:

libreoffice --headless --infilter="Hwp2002_Reader" --convert-to pdf:writer_pdf_Export <file>

they will get the following generic error:

Error: source file could not be loaded

Note that the exact same operation in an Intel MacOS device works.

deeplow commented 1 year ago

Maybe we should add to this release a note in the changelog that this is not yet support on arm macs.

And this is yet another case where the error message is confusing because it looks like pdfinfo is the one failing (saying that the PDF was corrupted), where in reality it wasn't even produced since LibreOffice failed.

OctopusET commented 1 year ago

I will take a look at this issue. I can borrow the m1 macbook for testing soon.

apyrgio commented 1 year ago

One assumption is that H2Orestart, the plugin we rely on for HWP conversions, somehow fails on ARM platforms. This is further reinforced by the fact that of the 3 HWP-related test files that we have under tests/test_docs_exernal:

sample-hwp.hwp.b64
sample-hwp97.hwp.b64
sample-hwpx.hwpx.b64

only two of those fail (sample-hwp.hwp.b64 and sample-hwpx.hwpx.b64). The sample-hwp97.hwp.b64 file passes, probably because LibreOffice already had support for it.

There's one issue with this assumption though: H2Orestart is a Java plugin, meaning that it should work across architectures, bar some exceptions. These exceptions usually boil down to usage of JNI (see this SO answer), but that's something we can detect. If we do jar tf h2orestart.jar, we get back a list of .class files, which should be platform agnostic.

@OctopusET thanks a lot for taking a look at it. Tagging @ebandal as well, in case they are interested in this issue as well.

deeplow commented 1 year ago

Following some feedback, we have two options here: a) not ship the hwp feature until it's fully working across the board (I guess we can ignore Qubes for now) b) Inform users that choose hwp or hwpx file on arm Macs that it's not supported (via an alert, for example) c) fix the bug, of course.

I'd rather go for option b) if c) is not possible. It should be relatively trivial to implement.

OctopusET commented 1 year ago

I can use the M1 macbook after August 8 for dangerzone development. Before that I can only try some conversion in M1 macbook with h2orestart.

OctopusET commented 1 year ago

I just started work on M1 macbook yesterday. Seems like just conversion with H2Orestart is working.

Question: I just drop the commit that is "HWP/HWPX disable on MacOS Apple Silicon", is there any way to re-enable the HWP/HWPX conversion feature on MacOS Appli Silicon?

OctopusET commented 1 year ago

And with Dangerzone GUI, there's an issue with selecting the HWP/HWPX files. You can't select it. So, I tried with only CLI.

deeplow commented 1 year ago

Question: I just drop the commit that is "HWP/HWPX disable on MacOS Apple Silicon", is there any way to re-enable the HWP/HWPX conversion feature on MacOS Appli Silicon?

Thanks for looking into this. We can do this but only for the next release since we just shipped this one. However, @apyrgio did experience this issue, so it may still happen particular files.

And with Dangerzone GUI, there's an issue with selecting the HWP/HWPX files. You can't select it. So, I tried with only CLI.

We decided to remove the ability to add these files in platforms where they were considered not compatible. Not a perfect solution, but it leads to avoiding getting errors later.

OctopusET commented 1 year ago

@deeplow, Ah, I mean, is there any way to re-enable that feature other than dropping commits.

deeplow commented 1 year ago

I think it's no big deal. We can always do another commit that does the reverse.

OctopusET commented 1 year ago

With my short analyses, it fails when it's checking converted file is exist or not. I think it's almost the same issue as the previous one. https://github.com/freedomofpress/dangerzone/pull/460#issuecomment-1611757116. Libreoffice might work differently on MacOS.

I think it's not H2ORestart's issue. I tested with several files including the our test files.

OctopusET commented 1 year ago

I can't use my borrowed M1 Macbook everyday. It might take while for actual testing.

OctopusET commented 1 year ago

It's strange because the conversion worked for me, but the issue said it didn't even work.

So I restart my Macbook. I found very strange behavior.

I tried the HWP->PDF, HWPX->PDF conversion on cold booted macbook. It failed like you mentioned at first. But, after you open libreoffice GUI, it's working.

And if you kill libreoffice, like with killall soffice. It fails again.

OctopusET commented 1 year ago

If you try with --safe-mode option, there are some additional error logs.

$ soffice --safe-mode --headless --infilter="Hwp2002_Reader" --convert-to pdf:writer_pdf_Export some.hwp
libc++abi: terminating due to uncaught exception of type com::sun::star::deployment::DeploymentException
Unspecified Application Error
OctopusET commented 1 year ago

Plus, I just started testing on the Raspberry Pi 4 with Raspberry Pi OS because it also uses one of the ARM chipsets like M1. Indeed, there are some differences, but it's worth trying.

I have only tested on the CLI. Like MacOS, HWP to PDF conversion also fails. I haven't tested with GUI. I will share the results later.

deeplow commented 1 year ago

Thanks for looking into it. Although, quoting from this other issue:

We don't plan on supporting linux 64 arm systems at the moment.

And while doing the extra work to support arm macOS can be justified due to the fact that a significant number of journalists use macOS, the same is not true for linux arm64, the same cannot be said for linux. However, if you can get it working there, that's great! There could even be a non-officially supported port. But it's best to continue this conversation in https://github.com/freedomofpress/dangerzone/issues/50.

OctopusET commented 1 year ago

Oh yes, so true. I just wanted to check it's also failing on the other ARM system. Still, I'm focusing on the MacOS (with Apple Silicon).

deeplow commented 1 year ago

Libreoffice might work differently on MacOS.

I don't think this is the case because LibreOffice runns fully on a linux environment (inside the container). I would go more for the other hypothesis that this is something that affects LibreOffice on arm systems. Maybe playing around with H2Orestart on LibreOffice in the Raspberry Pi may reveal similar symptoms.

I tried the HWP->PDF, HWPX->PDF conversion on cold booted macbook.

Just to clarify: when you tested this, you did so through Dangerzone? Or directly on macOS with LibreOffice? I may try to debug this further once I clear some of my other pending tasks.

OctopusET commented 1 year ago

Just to clarify: when you tested this, you did so through Dangerzone? Or directly on macOS with LibreOffice?

@deeplow I only tested directly on MacOS with LibreOffice, without Dangerzone.

deeplow commented 1 year ago

OK. I'm curious about the libreoffice raspberry pi results. I bet they will fail and my assumption in that case is that something is wrong with either LibreOffice for ARM or H2Orestart.

OctopusET commented 1 year ago

Here's the result of HWP to PDF conversion on Raspberry Pi 4.

I have only tested on the CLI. Like MacOS, HWP to PDF conversion also fails. I haven't tested with GUI. I will share the results later.

As I mentioned above, it fails. (Tested on Desktop mode only with the terminal)

$ soffice --headless --infilter="Hwp2002_File" --convert-to pdf:writer_pdf hello.hwp 
[08-15 02:46] (riterContext.dete) INFO: file detected not HWPX
[08-15 02:46] (riterContext.dete) INFO: file detected as HWP
[08-15 02:46] (OrestartImpl.dete) INFO: File is Hancomm document.
convert /home/pi/hello.hwp -> /home/pi/hello.pdf using filter : writer_pdf
Error: Please verify input parameters... (SfxBaseModel::impl_store <file:///home/pi/hello.pdf> failed: 0x81a(Error Area:Io Class:Parameter Code:26) ./sfx2/source/doc/sfxbasemodel.cxx:3153 ./sfx2/source/doc/sfxbasemodel.cxx:1735)

However, there is also a strange behavior. It succeeds when you open LibreOffice GUI version. I opened another terminal and just ran soffice. Then I ran the same script:

soffice --headless --infilter="Hwp2002_File" --convert-to pdf:writer_pdf hello.hwp 

It succeeded. I guess something is not opening before you open the libreoffice gui.

Update

I found some related issues: https://bugs.documentfoundation.org/show_bug.cgi?id=131323 https://github.com/shelfio/libreoffice-lambda-layer/issues/33

My libreoffice version on Raspberry pi 4 is LibreOffice 7.0.4.2 00(Build:2)

Update 2

Oh.. Nevermind there was an error on script: pdf:writer_pdf should be pdf:writer_pdf_Export But still, I don't understand why it worked when I open GUI.

apyrgio commented 1 year ago

Just to make sure I get this right. You're saying that in your Raspbery Pi, if you open the LibreOffice GUI before hand, and then perform the conversion via the terminal, it consistently works? Asking because I'm not sure what's the outcome after the switch from pdf:writer_pdf to pdf:writer_pdf_Export.

Also, just a heads up, I plan to open an issue to the H2ORestart repo, to notify the maintainer and other users about this problem, since it doesn't seem to be Dangerzone-specific.

apyrgio commented 1 year ago

@ebandal did a nice dig and probed the Alpine Linux devs: https://gitlab.alpinelinux.org/alpine/aports/-/issues/15212. Turns out that the LibreOffice package for aarch64 does not support Java. Hopefully this will be resolved upstream. Once it does, we should remove our GUI restrictions for this platform.

OctopusET commented 1 year ago

https://gitlab.alpinelinux.org/alpine/aports/-/commit/74d443f479df15fc57e6fde6ac02a36b24afdded

They enabled the java support for the aarch64

OctopusET commented 1 year ago

I think rebuilding image would fix it. I will test it soon.

OctopusET commented 1 year ago

I had to change the docker image to alpine:edge (20230901). But it works! capture

apyrgio commented 1 year ago

Nice, thanks for staying on top of this. I'll take a look at your PR and associated issue and we'll fix this.

apyrgio commented 1 year ago

@OctopusET: Quick update now that we are about to cut the 0.5.0 release. We wanted to switch back to alpine:latest (see #542) before the release, provided that the required LibreOffice version is in the repos. This does not seem to be the case, so we had two options at hand: either release on alpine:edge, or stay with alpine:latest and not include the latest fixes.

Ultimately, we decided that we don't want to risk any last minute breaking changes from the edge repo. This means that we will revert your fixes for now, and wait until November, for the new release of Alpine Linux, to properly include them.

It's a bit of a tough decision, but we believe it's best to prioritize the stability of the software over new features. Thank you for your persistence on this feature, and rest assured that the next release will include it :slightly_smiling_face:.

OctopusET commented 1 year ago

I agree with you about stability concerns. And eventually the package will be updated, so no need to rush for this.

Thank you for letting me know about this

OctopusET commented 11 months ago

@apyrgio https://www.alpinelinux.org/posts/Alpine-3.19.0-released.html

deeplow commented 11 months ago

Thank you very much pinging us about this and submitting a PR. We are just about to push out a security release, which should indeed include alpine 3.19. However, since it's a security-focused release and we had already hit feature-freeze, we will have to include your PR in the next one.