HazyResearch / pdftotree

:evergreen_tree: A tool for converting PDF into hOCR with text, tables, and figures being recognized and preserved.
MIT License
428 stars 90 forks source link

use np.sum to compute sum #122

Closed maldil closed 2 years ago

maldil commented 2 years ago

Description of the problems or issues

Is your pull request related to a problem? Please describe. Thank you very much for your excellent work in analysiscenter/batchflow. I am a graduate student at the University of Colorado-Boulder, studying the best practices of evolving ML codes. From our research, one of the most common evolution best practice in ML code is the migration of loop-based computations to vectorization, since this usually improves performance. We made the following changes in batchflow, which remove the FOR loop and use NumPy APIs. I carefully checked the modification to ensure that it does not break the code. I will gladly contribute. Please help me to merge this.

Does your pull request fix any issue. A possible performance issue

Description of the proposed changes

Use np.sum to compute sum of elements than using inefficient Python for loops

Test plan

I ran make test as described in the contribution guide line.

maldil commented 2 years ago

Hi @lukehsiao

Yes, you got me. I am an author of the project R-CPATMiner. However, this pull request is not auto-generated by the tool. If it is, I would have avoided this error. This is a human-made error due to not paying attention to properly mentioning the project name. I'm sorry for the mistake.

Yes you are correct, however, a number of studies that evaluate the effectiveness of list comprehension and Python for loops make the case for list comprehension over Python for loops in terms of efficiency. The use of both list comp and np.sum may result in a greater performance benefit even if this increases the number of iterations. I am happy to conduct a performance test for you if you could give me an idea of the variable boxes. However, I also think that this update is more Pythonic and also cuts down on the amount of lines of code.

Thanks again!