LIST-LUXEMBOURG / iguess

iGuess 1.0 - The iGuess implementation in Rails
GNU General Public License v3.0
6 stars 0 forks source link

Harvester is slow #267

Open ldesousa opened 9 years ago

ldesousa commented 9 years ago

The harverter.py script is taking 11 minutes to run, this might be an issue long term. Can it be optimised somehow?

uleopold commented 9 years ago

It would be important to know what is slowing down the script.

Looking at the code it is likely the loops which slow down the code. Vectorised programming would probably improve this. Avoiding loops and if else statements where possible might improve speed substantially. You probably do not need to loop over each row on the data base as this is not a required sequential operation. It could be done at once.

There are functions like Map() list-comprehension etc. See here: https://wiki.python.org/moin/PythonSpeed/PerformanceTips https://www.python.org/doc/essays/list2str/

ldesousa commented 9 years ago

It would be wiser to understand the cause of the slowdown before proposing random solutions. There are less than 200 rows in the datasets table, looping through them is certainly not the cause.

A programme without loops and ifs will not do much; the map function also produces a loop, just of a different kind. Also, keep in mind that Python is an interpreted language.

uleopold commented 9 years ago

It is a discussion not a solution. Profiling would certainly help to identify the causes. It is exactly because of interpreted languages that you need vectorized programming. That is why I suggested to look into it. It is the same in other languages, e.g. R. Anyway, profiling will probably tell you the bottleneck.

ldesousa commented 9 years ago

I have never seen the term "vectorised programming" before, you are perhaps referring to array programming, but that is something completely unrelated to this issue. It is also unrelated to the fact that Python is an interpreted language.

Profiling could help, but there are certainly easier ways to study this issue.

uleopold commented 9 years ago

Vectorised is a well established term and applies in particular to interpreted languages such as Matlab, R and Python etc. to operate on lists of strings or arrays.

As interpreted languages are very slow in looping it is advised to avoid loops when possible by vectorised programming, in particular inner loops. The only problem is that you need to rethink the code implementation when avoiding loops. All above is quite relevant and it would be more productive to first study the literature before flagging terms unrelated and inappropriate.

ldesousa commented 9 years ago

I clearly lack the knowledge for this task.