NRGI / rgi-assessment-tool

MEAN build of RGI 2015 assessment tool
MIT License
5 stars 1 forks source link

Validate imported documents based on extension #570

Closed iprunache closed 7 years ago

iprunache commented 7 years ago

Why

Being able to import documents from users is a critical function of the RGI tool and with the current upload process some files end up corrupted in the RGI tool storage.

One way to prevent corrupted documents from being imported is to do a best effort to validate them based on their extension before storing them as valid documents. This will not be possible for all types of documents but since the tool accepts only a limited range of file types we should be able to devise validation checks for a good part of them.

What

Notes

Some pointers on how to check for a valid PDF: http://stackoverflow.com/questions/28156467/fastest-way-to-check-that-a-pdf-is-corrupted-or-just-missing-eof-in-ruby http://superuser.com/questions/580887/check-if-pdf-files-are-corrupted-using-command-line-on-linux

alexander-elgin commented 7 years ago

The source file can be without any extension e.g. If is is uploaded by URL. Extension of the files transferred to the S3 server are generated automatically based on the file mime types. It is more robust solution. To check PDF files there is a tool created based on https://github.com/flexpaper/pdf2json. But sometimes it marks valid PDF files as invalid. Hence all suggested tools and approaches should be tested carefully before applying to the production environment.

alexander-elgin commented 7 years ago

It seems that we cannot use the pdfinfo command referred in http://superuser.com/questions/580887/check-if-pdf-files-are-corrupted-using-command-line-on-linux. Here is result of the command execution

Error (81): Illegal character <3f> in hex string
Error (82): Illegal character <78> in hex string
Error (83): Illegal character <70> in hex string
Error (86): Illegal character <6b> in hex string
Error (88): Illegal character <74> in hex string
Error (92): Illegal character <67> in hex string
Error (93): Illegal character <69> in hex string
Error (94): Illegal character <6e> in hex string
Error (95): Illegal character <3d> in hex string
Error (96): Illegal character <27> in hex string
Error (97): Illegal character <ef> in hex string
Error (98): Illegal character <bb> in hex string
Error (99): Illegal character <bf> in hex string
Error (100): Illegal character <27> in hex string
Error (102): Illegal character <69> in hex string
Error (104): Illegal character <3d> in hex string
Error (105): Illegal character <27> in hex string
Error (106): Illegal character <57> in hex string
Error (108): Illegal character <4d> in hex string
Error (110): Illegal character <4d> in hex string
Error (111): Illegal character <70> in hex string
Error (114): Illegal character <68> in hex string
Error (115): Illegal character <69> in hex string
Error (116): Illegal character <48> in hex string
Error (117): Illegal character <7a> in hex string
Error (118): Illegal character <72> in hex string
Error (120): Illegal character <53> in hex string
Error (121): Illegal character <7a> in hex string
Error (122): Illegal character <4e> in hex string
Error (123): Illegal character <54> in hex string
Error (125): Illegal character <7a> in hex string
Error (126): Illegal character <6b> in hex string
Error (130): Illegal character <27> in hex string
Error (131): Illegal character <3f> in hex string
Title:          République Islamique de Mauritanie                         Honneur Fraternité Justice
Author:         metou
Creator:        Nitro Pro 8  (8. 5. 0. 26)
Producer:       Nitro Pro 8  (8. 5. 0. 26)
CreationDate:   Fri Oct 10 15:42:44 2014
ModDate:        Fri Oct 10 15:42:52 2014
Tagged:         no
Pages:          22
Encrypted:      no
Page size:      595.32 x 841.92 pts (A4)
File size:      112116 bytes
Optimized:      no
PDF version:    1.4

for the file 00c1a008d50e128ff83c5cd5fc2e1bbd3bf6524b.pdf As you can see the file can be properly opened in a PDF viewer.

alexander-elgin commented 7 years ago

The example from http://stackoverflow.com/questions/28156467/fastest-way-to-check-that-a-pdf-is-corrupted-or-just-missing-eof-in-ruby is for Ruby. For Node.js I do not know a way cannot find a reference either for EOF processing. Node.js uses events and streams. Such low level is not available.

aismail commented 7 years ago

@alexander-elgin I am personally not satisfied with closing this down. This is actually a problem that the users have. Do you agree?

alexander-elgin commented 7 years ago

@aismail The solutions proposed by @iprunache either do not work or can not be applied. Hence I closed the issue. File damage during uploading from a user PC is not the only potential issue that can cause broken files it also can happen e.g. during upload from a remote host. Once I find a suitable solution I'll open a GitHub issue and implement it. Any solutions proposed by you or other team members are welcome for review.

iprunache commented 7 years ago

@alexander-elgin the document import code seems to read the documents before uploading to S3 so that can be also used to check if PDFs imported from an url have the EOF terminator. fs.read allows specifying a position to read from in case it's impractical to read the entire file.

In regards to pdfinfo, you need to check the exit code to see if the PDF is valid or not instead of the command output.