bwiley1 / pandleau

A quick and easy way to convert a Pandas DataFrame to a Tableau .hyper or .tde extract.
MIT License
61 stars 19 forks source link

parsing 13m x 89 table is quite slow. Any suggestion for improvements? #16

Closed mmistroni closed 4 years ago

mmistroni commented 5 years ago

Hi, not really an issue, but i am using pandleau to create an hyper file out of a 13m x89 table. Columns are half string and half numbers. Process takes quite a while (7 hours on a 16GB desktop). Was wondering if you could suggest potential improvements? Saw notes regarding Unicode slowing down python. any tricks to get around the issue? kind regards

bwiley1 commented 5 years ago

Hi mmistroni,

Thanks for making this point. Essentially, the tableau sdk adds single observations one at a time per row in the final tableau file. So while using multithreading options in pandas can speed up the process in python (i.e. using lambda functions, etc), it seems like there's a ceiling from the tableau sdk. I've thought it might be a good idea to add observations concurrently, or assigning a row number to input values during the sdk execution, but these are feedback for the tableau developing team (there's also a certain level of encryption in the tableausdk package - it would be nice to go from .tde to pandas df but this is restricted). It would be great if you brought these up to the tableau team!

Best, Ben

mmistroni commented 5 years ago

Thank you, Thank you for getting back to me. So, basically, to give you some figures , it takes approx 7hrs to create an hyper file for a 12m x 89 columns in python/pandas I have attempted to use pyspark to speed up the process, and i have reduced it down to 1.5 hrs, but then again is not a perfectp rocess like pandas, sometime spark fails , raised an OOM, restart the task and it'll result in duplicates Thanks for pointing out the issue with Tableau, i am going to ask Tableau support for advices as tableausdk is a black box to me. I will s urely keep you posted on outcome, but as you said, there's not much that can be done on our side kind regards Marco.

On Thu, Mar 28, 2019, 4:33 PM jamin4lyfe notifications@github.com wrote:

Hi mmistroni,

Thanks for making this point. Essentially, the tableau sdk adds single observations one at a time per row in the final tableau file. So while using multithreading options in pandas can speed up the process in python (i.e. using lambda functions, etc), it seems like there's a ceiling from the tableau sdk. I've thought it might be a good idea to add observations concurrently, or assigning a row number to input values during the sdk execution, but these are feedback for the tableau developing team (there's also a certain level of encryption in the tableausdk package - it would be nice to go from .tde to pandas df but this is restricted). It would be great if you brought these up to the tableau team!

Best, Ben

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bwiley1/pandleau/issues/16#issuecomment-477618542, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ7RDlnf1vKRsaOFuupiuv1s9PWz1WFxks5vbNF9gaJpZM4cMdJQ .

mmistroni commented 5 years ago

Hey so this is part of hte same workflow i am running. Running on RHEL7, using python 2.7+ pandleau.. i am getting a massive exception in pandleau.py

OSError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.20' not found

i was using pandas 0.24.something. Downgrading to 0.20.3 didnt fix it What did fix it was actually to do a local import of pandas rather than doing an import pandas at the top of the file I dont know what triggers it...it could be the fact tha pandleau tries to be smart to detect if you are using the old tableausdk api or the new one.... i dont know... it seems that pandas and the import from tableausdk import * somehow causes this problem have you ever seen it? Would you know what to do to address it? right now i had to copy paste the code and remove the global import I am pretty sure this has to do with pandas as , when i run the extractAPI python samples, i dont get any error - while when i edit the sample and add an import pandas also the extractAPI sample blows up ANy chances you can reproduce and help? thanks

bwiley1 commented 4 years ago

Hey @mmistroni! It's been a long time :) So several other users have contributed to pandleau, and in later versions the speedup has been considerably improved. Let me know if the module runs faster now on this example, thanks!

mmistroni commented 4 years ago

Hello cannot test it on RHEL7.... got issues with pandas dependencies. HOwever, the extract api is still slow when extracting 14M x370 cols.. and that is due to the way the extract api works- after reading other posts and tableau forums, it does not support concurrency. As long as your tableau data is <2m, you probably get decent results. Anything bigger, and it does not reallys cale. that is at least from my experience will post if i have any further updates. kind regards

On Thu, Aug 1, 2019 at 3:06 AM jamin4lyfe notifications@github.com wrote:

Hey @mmistroni https://github.com/mmistroni! It's been a long time :) So several other users have contributed to pandleau, and in later versions the speedup has been considerably improved. Let me know if the module runs faster now on this example, thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bwiley1/pandleau/issues/16?email_source=notifications&email_token=ACPNCDT6GFSCXEOLXD2FZELQCJALDA5CNFSM4HBR2JIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3JCN7I#issuecomment-517089021, or mute the thread https://github.com/notifications/unsubscribe-auth/ACPNCDU7A7AEB5SUK55QYA3QCJALDANCNFSM4HBR2JIA .

bwiley1 commented 4 years ago

Hi @mmistroni - that's a good point... You're right, the tableausdk doesn't currently support concurrency, so there seems to be a limit on speedup through workarounds in python alone. If tableau makes an update to their sdk in the future, I'll be sure to incorporate this into pandleau. Thanks!

ghost commented 4 years ago

@mmistroni @bwiley1 , interestingly I've run into this now as well -- my data is extremely large (around 500m rows, 300 cols) and I'm estimating it to take around 10 hours to run.

However, there have to be some workarounds. A commercial product, Alteryx, can create TDEs and calls the same DLLs as the SDK and runs the same file in ~30 minutes, which is obscenely fast all considering. I'll harass Tableau support to see if they have any guidance for working with larger datasets, or any insight as to how this one vendor was able to implement it so efficiently.

@mmistroni , any chance you'd be able to share some snippets of how you utilized pyspark for this?

mmistroni commented 4 years ago

Hello, sure... found this project on the net - it's in scala - and rewrote it in python as the two apis are similar. My usecase is that i have a massive dataframe in Spark and i need to create a hyper file out of it

https://github.com/werneckpaiva/spark-to-tableau/blob/master/src/main/scala/tableau/TableauDataFrame.scala

by running Spark locally and extracting a 14m x 389 dataframe, it takes 6hrs . ifyou go below a 1m row then you have decent times - in matter of minutes -

Alteryx will not do for me as i need to generate hyper files...

kind regards Marco

.

On Mon, Aug 5, 2019 at 8:05 PM Harrison notifications@github.com wrote:

@mmistroni https://github.com/mmistroni @bwiley1 https://github.com/bwiley1 , interestingly I've run into this now as well -- my data is extremely large (around 500m rows, 300 cols) and I'm estimating it to take around 10 hours to run.

However, there have to be some workarounds. A commercial product, Alteryx, can create TDEs and calls the same DLLs as the SDK and runs the same file in ~30 minutes, which is obscenely fast all considering. I'll harass Tableau support to see if they have any guidance for working with larger datasets, or any insight as to how this one vendor was able to implement it so efficiently.

@mmistroni https://github.com/mmistroni , any chance you'd be able to share some snippets of how you utilized pyspark for this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bwiley1/pandleau/issues/16?email_source=notifications&email_token=ACPNCDTJCGAAQUAQ7XXNN73QDB2Y7A5CNFSM4HBR2JIKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3SYSIY#issuecomment-518359331, or mute the thread https://github.com/notifications/unsubscribe-auth/ACPNCDWCZATSCE7ZBWZKGQ3QDB2Y7ANCNFSM4HBR2JIA .