brobertson / Lace2

In-broswer OCR editing program that transforms OCR results into structured, citable TEI. No XML experience required!
http://trylace.org
GNU General Public License v3.0
27 stars 2 forks source link

Performance issues? (CPU pegged at 100% during editing) #151

Closed mrgreekgeek closed 1 year ago

mrgreekgeek commented 1 year ago

@brobertson thank you so much for making this available; it's an amazing piece of work, and I'm really excited to be using it!

The performance has been a little problematic though; I'm seeing 100% CPU usage on my server whenever just one person is making corrections. Many of the getCroppedImage.xq requests never do get served because the CPU is so far behind in trying to keep up with all the requests. I've found that disabling the cropped image requests (with a custom rule in an adblock browser extension) helped relieve the server's CPU load significantly but that's a hack, not a solution. Is this the best Lace/Exist-db can do, and do I just need a much higher-powered server? Or maybe there are some optimizations or solutions I can try to get better performance? I'd like for several people to be able to work on this OCR project at once, but currently there's no way that's possible. Thank you to anyone who might be able to give me some advice!

brobertson commented 1 year ago

No, this isn't usual. We have multiple folks working on a machine at once with no such problems. Can you tell me your OS, Java version and eXist-db version? I haven't kept up with all the compatiblities, and I suspect there's a problem with a more recent version of eXist-db.

brobertson commented 1 year ago

Oh, I suppose this also could be a problem caused by the size of your image set. We usually convert them to a smaller resolution and binary images to avoid these problems. How did you generate your image and OCR xar files?

BTW, I'm most gratified that you're using this yourself. Most work is done on one of my servers or those of the Center for Hellenic Studies. If we run into problems, I can always set you up on one of those as a quick solution. But of course, it's great to get this sort of feedback.

mrgreekgeek commented 1 year ago

No, this isn't usual. We have multiple folks working on a machine at once with no such problems. Can you tell me your OS, Java version and eXist-db version? I haven't kept up with all the compatiblities, and I suspect there's a problem with a more recent version of eXist-db.

Thanks for the prompt reply, @brobertson! Glad to hear that my case is unusual, which gives me hope that we can solve the issue. :)

I'm running Ubuntu 22.04 with 1GB of RAM and 1 vCPU. (I recently upgraded to 2 CPUs and while that sure helped, it didn't solve the problem; the cropped image generation is still much too slow to be usable, and blocks other requests).

openjdk version "11.0.19" 2023-04-18 exist-db version 6.2.0

mrgreekgeek commented 1 year ago

Yes, I was just realizing that image resolution might have been part of the issue... I used tesseract 5, and then lacebuilder to prep the files (from archive.org) and then uploaded the results to my Lace2 instance. I noticed that the files do seem to be much higher resolution than they would need to be. Should I resize them all to a standard size and then rerun OCR and lacebuilder? I don't suppose it's possible to resize them after the OCR has already been run (since it would mess with the hOCR boxes)?

brobertson commented 1 year ago

The other issue is the 1 GB of RAM. I'm not sure any image resolution would succeed with that limitation. Would you like me to mount your files on my instance at heml.mta.ca/lace?

mrgreekgeek commented 1 year ago

Would you like me to mount your files on my instance at heml.mta.ca/lace?

Maybe that would be simplest. :) Thanks for your kind offer! How can I contact you to send you the files?

brobertson commented 1 year ago

You can email me at bruce.g.robertson@gmail.com and we'll work it out.

brobertson commented 1 year ago

So, in conclusion, the slow performance here is due to insufficient RAM. My guess is 4 GB is necessary.