PhilGale92 / docx

PHP Based Docx Parser
MIT License
38 stars 19 forks source link

Cannot parse large docx files #53

Closed sujaypatil96 closed 5 years ago

sujaypatil96 commented 5 years ago

The library works perfectly when I upload files which have a small size, but it fails to parse files that are larger in size.

For example, asume that the docx file has only text/characters and tables (black and white, no colour), it works if the file has around 40 such pages (size approx. 50 KB), but fails to parse files with more number of pages and a greater size that.

PhilGale92 commented 5 years ago

Hmmm i've been able to use it on word files that were a few MB large. Its probably your larger files have some other issue inside.

Is there any error message in paticular you're recieving? If you could send an example sample file that gives you the problem I will look into it when I have a moment.

Thanks,

sujaypatil96 commented 5 years ago

Hm, thank you for clarifying but unfortunately I won't be able to share the reports with anybody as per company policy. However, I can describe it for you - A larger file would include text (no colour), and tables - approximately 20 tables with up to, 1500 rows and 10 columns each (at least, with around 50 words in each cell).

You could create a dummy docx with spec. as above and try it out maybe?

There is no error warning that pops up per se, it just times out. I've tried changing the 'max_execution_time' config value as well, so it doesn't time out, and now it just runs forever (for 15 mins at least, I didn't check beyond that).

sujaypatil96 commented 5 years ago

Oh, I just let the script run for 0.5 hours and it does parse the file. My PC is quite underpowered so, that could be the issue.

Thanks for checking it out. :+1:

I'm sure you've optimized the funtions as much as you could, but I think we could explore ways to optimize this further?

PhilGale92 commented 5 years ago

Cool great to hear.

To be honest I wrote this a pretty long time ago so im pretty sure it could be optimised a fair bit (perhaps knock off a few stages of the table handling for example).

If I work on this again for anything other than small bug fixes it will be to rewrite everything to use some of what I have learnt in the past 5 years. But I just don't have much free time I use on coding anymore!