ether / etherpad-lite

Etherpad: A modern really-real-time collaborative document editor.
http://docs.etherpad.org/
Apache License 2.0
16.79k stars 2.86k forks source link

[QUESTION] PDF and Docx import results unformatted text. #4385

Open afifa-glowlogix opened 4 years ago

afifa-glowlogix commented 4 years ago

I've followed this link to https://github.com/ether/etherpad-lite/wiki/How-to-enable-importing-and-exporting-different-file-formats-with-AbiWord to install LibreOffice and ep_import_documents_hook https://github.com/mrbabbs/ep_document_import_hook to import documents in my pad, but the resultant pad text is unformatted. It'll have a lot of extra spaces and change the lists to only 1.

JohnMcLear commented 4 years ago

Our tests pass here.

https://github.com/mrbabbs/ep_document_import_hook -- are you sure you want to use this? Perhaps it's causing problems?

This looks to be a plugin issue unless you can replicate on say https://video.etherpad.com -- if you can, please provide the document.

afifa-glowlogix commented 4 years ago

I think it has something to do with high fidelity of the document.

afifa-glowlogix commented 4 years ago

Here is one of my documents: https://drive.google.com/file/d/1mAfZxHkR2ny5SMHVsAKF-fuY2A1RaT1u/view

JohnMcLear commented 4 years ago

Did you try without plugins? Does it work on video.etherpad.com?

afifa-glowlogix commented 4 years ago

Yeah. I've tried without the plugin as well but same results. and video.etherpad.com is also generating the same result.

JohnMcLear commented 4 years ago

https://video.etherpad.com/p/3HvCofvIJq1TsySXHaEv works...

JohnMcLear commented 4 years ago

I think this is more "I want Etherpad to behave the same as Word/Docs" not "there is an actual problem". Etherpad formats content differently and behaves differently because it's entirely different software.. Do you have a specific problem or???

JohnMcLear commented 4 years ago

I'm seeing the document you gave us correctly imported with correct line listing. I'm also seeing Etherpad handle line numbers completely fine, by using 1.1 et al not 1.a..

Please try to be coherant. Provide one specific example in one document and frame your question as that, see the new issue guidelines for some advise in how to create bug reports.

afifa-glowlogix commented 4 years ago

video etehrpad 2 video etherpad

@JohnMcLear that's how it's showing up here. All the indentations have gone and the page no is displaying on top? plus there is a sub-list under Position heading in original document.

Is there anything I'm missing? I can miss stuff while setting etherpad-lite on my system but it's strange the results I'm getting on video.etherpad.com

afifa-glowlogix commented 4 years ago

@JohnMcLear okay. I'm sorry I brought up a different document's format issue here. The link https://video.etherpad.com/p/3HvCofvIJq1TsySXHaEv is not displaying the document correctly to me.

webzwo0i commented 4 years ago

I'll take care of this as part of fixing https://github.com/ether/etherpad-lite/pull/4240 Hopefully it will be ready this week

webzwo0i commented 3 years ago

The spaces issue should be fixed in current develop branch. It would be great, if you could test with the latest changes.

The indent issue is gone using the XHTML converter of soffice, but we are not ready to switch the converters yet. I'm not sure if this is a bug in libreoffice or their intented behavior, so that needs further investigation. (I don't see any hint of the indentation with the standard html converter)

The improper implementation of nested lists on your document is a bug in libreoffice's HTML converter. It makes a new OL for the level 2-nesting, but outside the OL of the first one. This means, the a/b/c sub-list is at the same level as the outer-most list, it just uses a/b/c instead of numbers. I look into libreoffice's bugtracker/recent releases to find out, if it's a known bug. If not, I don't think we can do anything. (Also needs more investigation to ensure, I don't made a mistake. My first impression is, that we can't distinguish if that list is nested or not.)

So I'm sorry that two of the issues can't be solved easily, but we're getting more test coverage atm and hopefully this will ease the transition to XHTML converter.

RE the printed page number, I'm going to fix this

JohnMcLear commented 3 years ago

@afifa-glowlogix any feedback?

afifa-glowlogix commented 3 years ago

@JohnMcLear Most of our users are non-technical and it was hard to make them understand this issue so we ended up using google docs. Though, thanks for the resolution of the issue @webzwo0i, I'll take some time out to check it with our documents.

JohnMcLear commented 3 years ago

I'm pushing this back a version as the majority of the support is in.

JohnMcLear commented 3 years ago

Bump @webzwo0i