datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
195 stars 40 forks source link

Join takes too long time (or hangs) to process the data #66

Closed zelima closed 5 years ago

zelima commented 5 years ago

I'm trying to solve this exercise https://github.com/ViderumGlobal/programming-exercise but join needs so big time to process that I thought it just hang and could not finish the task. Don't see any while loops in join.py so I doubt I'm getting in an infinite loop, making me think that it's just slow.

I simplified the code

from dataflows import Flow, load, join, printer, filter_rows, 

def filter_over_10(rows):
    for row in rows:
        if row.get('order') is not None and row.get('order') > 10:
            continue
        yield row

res = Flow(
        load('data/movies/datapackage.json'),
        load('data/credits/datapackage.json'),
        filter_over_10,
        filter_rows(not_equals=[{'revenue': 0}], resources=['tmdb_5000_movies']),
        filter_rows(not_equals=[{'gender': 0}], resources=['tmdb_5000_credits']),
        join('tmdb_5000_movies', ['id'], 'tmdb_5000_credits', ['id'], fields={'revenue':{}}, full=False),
        printer(),
).results()
akariv commented 5 years ago

Are you using the speedup version (the one using leveldb)?

On Tue, Jan 29, 2019, 07:35 Irakli Mchedlishvili notifications@github.com wrote:

I'm trying to solve this exercise https://github.com/ViderumGlobal/programming-exercise but join needs so big time to process that I thought it just hang and could not finish the task. Don't see any while loops in join.py so I doubt I'm getting in an infinite loop, making me think that it's just slow.

I simplified the code

from dataflows import Flow, load, join, printer, filter_rows,

def filter_over_10(rows): for row in rows: if row.get('order') is not None and row.get('order') > 10: continue yield row

res = Flow( load('data/movies/datapackage.json'), load('data/credits/datapackage.json'), filter_over_10, filter_rows(not_equals=[{'revenue': 0}], resources=['tmdb_5000_movies']), filter_rows(not_equals=[{'gender': 0}], resources=['tmdb_5000_credits']), join('tmdb_5000_movies', ['id'], 'tmdb_5000_credits', ['id'], fields={'revenue':{}}, full=False), printer(), ).results()

  • movies is ~4000 rows
  • credits ~40000 after the filter

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/datahq/dataflows/issues/66, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQMdRU9__1sqpGHOR4CyyJqOqkwBra-ks5vH92QgaJpZM4aXVFi .

zelima commented 5 years ago

speedup version?

akariv commented 5 years ago

See here: https://github.com/datahq/dataflows/blob/b818aaac1b70abee8abc48fdf1b7933acfee335b/TUTORIAL.md#installation

On Tue, Jan 29, 2019 at 8:27 AM Irakli Mchedlishvili < notifications@github.com> wrote:

speedup version?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/datahq/dataflows/issues/66#issuecomment-458422593, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQMdZZBg_TRaeNLm5VUG21ZQXijRTn1ks5vH-nEgaJpZM4aXVFi .

zelima commented 5 years ago

That's a lot faster

akariv commented 5 years ago

:D

On Tue, Jan 29, 2019 at 11:31 AM Irakli Mchedlishvili < notifications@github.com> wrote:

That a lot faster

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/datahq/dataflows/issues/66#issuecomment-458470082, or mute the thread https://github.com/notifications/unsubscribe-auth/AAQMdStdvRcJC2NsrAC1SDlBEn_OLRVAks5vIBUNgaJpZM4aXVFi .