intake pipeline : the segmentation is (very) slow

PatGendre commented 4 years ago

I've reset the pipeline and run it again with all traceur fabmob data, which is not so much, my user for example has 30109 background locations, which is among the highest number. The computation time can be very large, especially for trip segmentation and section segmentation. For exemple for my user data, the trip segmentation takes 8 hours! (while section segmentation took a more reasonable 10 minutes)

Could it be due a lack of memory when the data size gets too large? (only 3 users have a very long processing time)

Note also that in the console log there is a warning for the segmentation stage: /root/anaconda3/envs/emission/lib/python3.6/site-packages/pandas/core/indexing.py:179: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy self._setitem_with_indexer(indexer, value)

PatGendre commented 4 years ago

@shankari FYI I executed the pipeline for the user with most background locations today (ca. 66000 points in the timeseries db), the segmentation stage took 15 hours.

shankari commented 4 years ago

This is a known issue for trip segmentation. I think that it is because of the implementation of the algorithm, in which we iterate over the points one by one in python. Replacing by vectorized operations (e.g. pandas) should make this much faster, but I don't have time to work on it right now. This may be a good fix to sneak is as part of the backend merger.

PatGendre commented 4 years ago

@shankari Ok thanks, I did not find this exact same issue so I opened it, it is definitely not urgent and I am pretty sure there will a way to speed up this stage.

e-mission / e-mission-docs

intake pipeline : the segmentation is (very) slow #470