Vitaliy-1 / docxConverter

Plugin for OJS 3 that parses DOCX and converts it to JATS XML format
GNU General Public License v2.0
21 stars 11 forks source link

No images displayed (using DocxConverter & JatsParser) #10

Closed LoicE5 closed 2 years ago

LoicE5 commented 3 years ago

Hi @Vitaliy-1,

I am currently implementing the following workflow in OJS with two of your plugins : DocxConverter & JatsParser.

  1. The author submits the article in DOCX
  2. The DOCX article is converted to JATS with DocxConverter
  3. (optionnal) The editor may edit the article with Texture Editor plugin
  4. On publication stage, JatsParser generates the full text HTML & a PDF galley

This entire process is working fine, with one exception : figures.

When I do upload the document, no images are displayed in the HTML or in the PDF.

Here's some processes that I tried

  1. Uploading the DOCX file and converting it to JATS without adding any supplementary file.
  2. Uploading the DOCX file and converting it to JATS, then adding the images as JPEG files as attached files to the XML.
  3. Uploading the DOCX file, and the images at the submission stage (at the same level). Converting the DOCX to JATS and publishing.
  4. Uploading the DOCX file and converting it to JATS, then adding the images to the every directory of the article (for testing purposes) via FTP.

In every case, the DOCX document comes from MS Word or Google Docs.

I don't get any Fatal Errors in the Apache Log.

Would you have any advices regarding images display in DocxConverter & JatsParser ?

Thanks a lot by advance for your help !

Loïc

Vitaliy-1 commented 3 years ago

Hi @LoicE5,

I see the error related to JATS XML parsing in the log. Have you noticed fig tags in the JATS XML that is produced with DOCX Converter plugin (like https://jats.nlm.nih.gov/archiving/tag-library/1.1d1/n-ib40.html)?

LoicE5 commented 3 years ago

There are <fig> tags in the converted JATS XML. Here's a screenshot :

image

I made a GDrive folder with all the files for this example : https://drive.google.com/drive/folders/1HnFWpQysnU_V9mi-KhUWwU2Yk0qHoTfR?usp=sharing

(The docx document comes from GDocs in this folder, I also tried with MS Word.)

The issue might also be coming from JatsParser Plugin... Maybe a routing issue ?

Here's the test journal with the element inspector :

image (a logical URL for the image would be http://192.168.1.10/ojs2_n4/index.php/monjournal/article/view/129/image1.jpg, but it returns 404).

In this process, we submit images as complementary files attached to the XML document during the production stage. They are stored in /var/www/ojs_n4-files/journals/1/articles/129/submission/proof.

I also noticed that files in the arborescence are renamed, including images... I tried to directly inject image1.jpg,image2.jpg,image3.jpg in the arborescence, without any success.

image

To sum up:

Thanks a lot for your help ! I will be happy to give you more details if you need them :)

Loïc

Vitaliy-1 commented 3 years ago

Thanks! Yes, look like JATS Parser Plugin is unable to pick up the correct path to the images in the system.

How DocxConverter is supposed to manage images included in it ? Should we upload the images independently of the conversion ? The plugin copies images from DOCX archive and attaches them to the resulted file automatically during conversion.

When we add an image through Texture and upload the images again in the dependancy grid (as it is mentionned in the readme of Texture), what else do we have to do so that the images will be displayed in the HMTL and pdf with JatsParser ? You shouldn't do anything else.

Can you also share with me any DOCX file with images inside that aren't correctly parsed to try the conversion chain to reproduce the error?

LoicE5 commented 3 years ago

Here's some files that I created specifically for test purposes, that contains images :


DOCX created with Google Docs image1.jpg image2.jpg image3.jpg


DOCX created with MS Word image1.jpeg image2.jpeg image3.jpeg

LoicE5 commented 3 years ago

Hi @Vitaliy-1 ,

After some researches, I may have found a workaround that would allow images to be displayed in the DOM. This is using Texture.

When you attach images to the XML file, then edit it with Texture, you can inspect the images using browser developer's tools.

image

We then get a path with 4 GET parameters :

http://<domain>/index.php/<journal_name>/texture/media?submissionId=139&fileId=392&stageId=5&fileName=image1.jpeg

It would then be possible using the custom header plugin to use JavaScript to dynamically replace existing img's src's to the path above, with the relevant GET parameters (using a forEach loop).

However, I cannot see any way to echo the $fileId in the document (It would be my only way to get this value in JS).

image

I've located the place where the $fullText is extrapolated from the displayFullText function. However, it seems difficult to extrapolate the $fileId from this function as long as it's not declared, and the args input array doesn't seems to include the value that I'm looking for...

image

Do you have any idea about how I could echo in the HTML DOM the $fileId variable (in a display:none tag) ?

Thanks a lot for your help (and your involvment in the PKP community) !

Loïc

Vitaliy-1 commented 3 years ago

Hi @LoicE5,

I hope to take a look at the issue and examples tomorrow.

JATS Parser's controller (handler) extends OJS's ArticleHandler and adds a method for operations with dependent files: https://github.com/Vitaliy-1/JATSParserPlugin/blob/90f9eb4813de35e275de78cb616fde7516c48554/FullTextArticleHandler.inc.php#L22. I'd start the debugging from this method to see if/where it fails. If I recall correctly, downloadFullTextAssoc operation accepts submission id, JATS XML file id and image file id as the first 3 arguments. Let's say original XML file has and ID 1000; the file is linked to the submission with ID 100; dependant image, attached to the file has an ID 1001; then the image should be available at:

.../journal/article/downloadFullTextAssoc/100/1000/1001

Another crucial place is where the image path is built for the actual HTML and PDF: https://github.com/Vitaliy-1/JATSParserPlugin/blob/90f9eb4813de35e275de78cb616fde7516c48554/JatsParserPlugin.inc.php#L727. This method is called just before fullText is assigned to the template. If the real path to the image is missing in the constructed HTML, probably the problem is somewhere here.

Do you have any idea about how I could echo in the HTML DOM the $fileId variable (in a display:none tag) ?

The path is constructed here: https://github.com/Vitaliy-1/JATSParserPlugin/blob/90f9eb4813de35e275de78cb616fde7516c48554/JatsParserPlugin.inc.php#L739, the file id is the last there ($dependentFile->getFileId()). I don't assign fileId to the template, thus it cannot be called from there directly.

Texture plugin has its own handler and the logic may be slightly different.

Let me know if you need more details.

Vitaliy-1 commented 3 years ago

Regarding handlers: https://docs.pkp.sfu.ca/dev/documentation/en/architecture-handlers

Vitaliy-1 commented 3 years ago

@LoicE5, I was able to reproduce the problem only for JATS Parser Plugin v. 2.1.9-3, which is intended to work with OJS 3.3 and added a fix to stable-3.3.0 branch of the plugin. See issue: https://github.com/Vitaliy-1/JATSParserPlugin/issues/59 and referenced commit.

However, the test of stable-3_2_1 branch, which corresponds to JATS Parser v. 2.1.9-2, doesn't show any problems. I did the test with the files attached above: uploaded to the production stage, converted with DOCX Converter Plugin, added as a full text and published with activated JATS Parser Plugin. Are you sure about the versions? If yes, can you do some debugging according to the hints I've posted above?

LoicE5 commented 3 years ago

Hi @Vitaliy-1,

Thanks a lot for your reply and your precious help.

The versions are, after re-check :

I’ve tested the path you mentioned above (.../journal/article/downloadFullTextAssoc/100/1000/1001) and it’s working at a glance.
I've been looking to the code and I think that I'll get deeper into it, following your hints, with @letailli in the upcoming days.
I'll probably come back to you with more observations and maybe, I hope, some possible solutions but if you have one your side in the meantime do not hesitate to share :)

Thanks a lot for your work !

Loïc

Vitaliy-1 commented 3 years ago

I’ve tested the path you mentioned above (.../journal/article/downloadFullTextAssoc/100/1000/1001) and it’s working at a glance.

This narrows the problem down to the part of code that replaces the path to the image: https://github.com/Vitaliy-1/JATSParserPlugin/blob/stable-3_2_1/JatsParserPlugin.inc.php#L755-L763 E.g., I'm escaping filename with https://www.php.net/manual/en/function.rawurlencode.php, which may cause troubles of filename contains non-alphanumeric characters.

Vitaliy-1 commented 3 years ago

Hi @LoicE5,

Did you manage to figure out where the problem is?

LoicE5 commented 3 years ago

Hi @Vitaliy-1,

I inspected your code and I have not been able to solve the issue. The problem is, I believe, that the images are printed in the document with a relative url ("image1.jpg","image2.jpeg"...). A fix could be to replace these relative paths into full paths as you mentioned below (.../journal/article/downloadFullTextAssoc/100/1000/1001).

Thanks for your help anyway, and sorry for my late reply :)

Have a nice day!

Vitaliy-1 commented 2 years ago

Hi @LoicE5,

Images are handled the same way is in the HTML galley. This path is replaced with the absolute before showing on the front-end. I was able to reproduce the error recently in OJS 3.3, will check in near time.

Vitaliy-1 commented 2 years ago

Finally had some time today to explore and found the problem in how a localized file name of the image is handled. It seems to arise not from JATSParser or DOCXConverter plugin. Hope to have some more information tomorrow.

Vitaliy-1 commented 2 years ago

The problem was in the Texture plugin: https://github.com/pkp/texture/issues/103

letailli commented 2 years ago

Thanks a lot @Vitaliy-1. We made a (very quick) fix with JS for our urgent problem. We'll come back to it for a future release.