LanguageMachines / PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation
Other
48 stars 6 forks source link

Make Frog options available in PiCCL just as in Frog web app #25

Closed martinreynaert closed 6 years ago

martinreynaert commented 6 years ago

Hi proycon, Right now, if one choses Frog in the PICCL workflow, it runs all it has. Can you not replicate Frog option selection/deselection in the PICCL workflow, as you have it in the Frog web application? Thanks! Martin

proycon commented 6 years ago

Yeah, that would be possible.

zeusttu commented 6 years ago

+1! It would be much appreciated if at least the dependency parser could be disabled. Running Frog with all options enabled on a book with several hundreds of pages requires an immense amount of memory to the point of being inadvisable in production environments.

Currently a Piccl job that includes Ticcl and Frog runs out of memory and crashes when offered a pre-OCR'ed pdf of the book Max Havelaar (downloaded from Google books in case you want to reproduce this yourself). Monitoring the memory usage of the machine during this job reveals that Ticcl does not use a lot of memory at all, but that Frog ends up eating all 12 gigabytes of memory the machine has plus the 800 megabytes of available swap space shortly before it crashes.

I have attached a memory usage graph. The axis labels are in Dutch (sorry for that). The blue line represents the machine's total memory usage, minus the idle-state offset (determined as the minimum memory usage over the measured interval). The orange line represents the swap usage. The vertical red lines mark the start of the job, the transition from Ticcl to Frog and (less visible) the moment where the job crashes according to the log file, respectively.

memory_and_swap_usage_while_running_piccl_with_ticcl_and_frog_on_max_havelaar

proycon commented 6 years ago

Hmm, I thought I had already disabled the dependency parser by default disabled in the current version, but indeed it seems not to be the case, I'll implement this right away then.

Thanks for the graph, that's quite insightful. Even without dependency parser, Frog remains a memory-based system so memory-usage will be on the higher end. For our purposes I consider 12GB quite a low amount of memory and wouldn't suggest a really system with less than 32GB. (our production server has 512GB, although shared with all other services).

I think the speed could be improved here as well (on a proper multicore system), we might have a reoccurrence of #13 here, I'll see if I can improve the pipeline a bit as it wasn't optimized yet here.

zeusttu commented 6 years ago

Wow that's a fast response, thanks! From Frog's help text I think it should be as simple as passing --skip=p to frog.

proycon commented 6 years ago

Yes, indeed, it's a matter of passing a simple option, the frog.nf workflow in PICCL takes the same arguments.

I now implemented this in the latest development version (git master branch), but haven't tested it yet. I implemented it the other way round; users can select which modules they want to run, with a few preselected.

zeusttu commented 6 years ago

I have not tested it either but looking at the diff I think it should work :slightly_smiling_face:

I've got one small tip about Python though: dictionaries have a get-method which is similar to indexing it but returns a default value if the key does not exist. So some_dict.get("x", 5) is equivalent to some_dict["x"] if "x" in some_dict else 5. You can also specify the default value as a keyword argument (some_dict.get("x", default=5)) or leave it out, in which case it defaults to None. So "x" in some_dict and some_dict["x"] could be rewritten as simple as some_dict.get("x"). I hope you appreciate my unsollicited code review, if not then I apologise :sweat_smile:

proycon commented 6 years ago

I have not tested it either but looking at the diff I think it should work :)

Yes, that's what I thought too but is usually when things go wrong in my experience ;)

I know about the dict.get() method yes, I just forget to use it and get stuck in old habits sometimes ;) So perhaps this helped ;)