did'nt getting the data

PurshotamSingh commented 4 years ago

why this connector is unable to upload the raw data of the given link to the tableau ,it's taking so much time but unable to open it. "https://raw.githubusercontent.com/microsoft/Bing-COVID-19-Data/master/data/Bing-COVID19-Data.csv"

KeshiaRose commented 4 years ago

Hi @PurshotamSingh, it looks like the CSV file is too big for the app server to handle. However, you could deploy your own version of the application and allocate more resources to your own personal app. -Keshia

rferraton commented 3 years ago

Hello @KeshiaRose first thanks for the csv wdc which work well for small files. I try it directly on your heroku app or on the heroku one i built with your repo.

I also test it on my laptop using local nodejs. (no problem of ressource here). Unfortunatly when trying with file bigger than 2MB, the qtwebengine.exe process increase it's cpu consumption (up to 20%) and never archieve to load the file.

Nota: if you cancel the import and retry it, another qtwebengine spawn and start eating 20% more, you can easily cpu saturate a machine...

I think (but i may be wrong) that the getschema part in to long (row by row, field by field) and may be it could be done using only a small part on the file (100 rows for exemple). I also think multi-tables control is too much (my opinion)

There is also a npm package named fast-csv that may be help : fast-csv exemples

I try it against the following url evolution of french covid incidence at department and age groups level which is a semi-colon separated CSV of 11MB.

I currently go back to another solution using CDATA CSV JDBC Drivers which allow and url as source but i definitively prefer your approach using wdc (nothing to install on client or server side)

I am using : Tableau 2021.1 Nodejs 14 Windows 10

regards

KeshiaRose commented 3 years ago

Thanks for the feedback @rferraton! I especially appreciate that you actually looked at the code and offered not only suggestions but a sample! This makes it really easy to test things out. I've gone back through and refactored the WDC to add a few things:

Fast Mode: You can now ignore data types and just bring in the CSV as all string types. This makes it super fast and works with bigger data sets, including the one you mentioned! You can still control the data types once in Tableau.
Data Typing Mode: As you said, the determining of data types was using the whoooole data set, I've set this to now just look at the first 100 rows. This is the default mode since people are already using the CSV with this feature but it should now be faster as well. I've also separated the data typing functionality out from the data cleaning step as to not do all the work at once.
Non-Proxy Attempt: The WDC will now try to fetch the data directly instead of always using the server proxy. This makes fetching the raw data faster.

Let me know if the new Fast Mode works for you!

@PurshotamSingh You can also now use Fast Mode for the bigger CSV you mentioned!!

-Keshia

rferraton commented 3 years ago

Hello @KeshiaRose , thanks a lot for theses improvments

Fast Mode work like a charm ! The data retrieval is fast even for "big" files (at least the 11MB given as source).

the data typing mode is better because the schema is retrieved fast (using the first 100 rows) but the data retrieval is too slow for file over 5MB and, with a such size file, this mode heat my cpus a lot :-)

I am not enough experienced with nodejs and js to dare push something now. I am more an SQL guy. If you have some good links where to learn nodejs+js+tableau wdc+debug i would greatly appreciate.

I fork your repo, and will ask some of my fellow to have a look on what we could do to improve the Data Typing Mode.

Thx a lot

/Romain

KeshiaRose commented 3 years ago

Wonderful, glad to hear it!! Yes, the data cleaning function is what takes the most time. I'm doing some things to clean up the data for Tableau, like turning the string "true" to an actual boolean true. I'll have to try out the data typing without data cleaning and see if everything still comes in OK. Once you tell Tableau the data type if things don't match up it can cause problems. I'm always glad to get new ideas though so let me know if you guys come up with some helpful changes! As for learning more JS, I really like https://javascript.info/ and for an all-up Node.js+WDC tutorial check out this live stream I did a while back: https://youtu.be/JyteK-EXbLs

rferraton commented 3 years ago

Thanks for the links ! and all your efforts !

I think if you make the data typing with first 100 rows on the schema retrieval then you don't need to double check all data. If errors happened this means that the file is a bad file. Even the rows rejections is a cool feature, it can cause an hidden loss of data don't you think ?

I have many ideas and challenges for you :-) Do you want use github's project for that ?

rferraton commented 3 years ago

Sorry i missspoke when i wrote

There is also a npm package named fast-csv that may be help : fast-csv exemples

I try it against the following url evolution of french covid incidence at department and age groups level which is a semi-colon separated CSV of 11MB.

I paste the npm fast-csv reference at the end of my writing, forgetting the following sentence. I didn't try fast-csv, it was just a suggestion and regarding the performance of fast-mode it may not be a good idea

KeshiaRose commented 3 years ago

All suggestions are welcome! 🙂 I actually write in glitch and then export to GitHub but I think suggestions are easier to do in GitHub. Yeah, the data cleaning is really just guardrails for bad CSVs 🤣, could probably do a third option: data typing on 100 rows + no cleaning or something.

rferraton commented 3 years ago

I tried the fast mode with _determineTypes (with the url i gave in my previous question) : it work very well ==> datatypes for getschema and fast import for getdata. May be a third options as you said but i think you can use the getSchema with only one mode (using _determineTypes) and let the datatyping/fastmode choice for getData only.

I will wait for you, when you want to launch a github project in this repo.

KeshiaRose commented 3 years ago

I think you're right I could just get rid of the fast part of the getSchema section but I like the idea of having a mode that is just bringing in the data without any parsing or fiddling so instead I added a third mode with the best of both worlds. "Loose Typed Mode" will use the _determineTypes method in the schema but will run the parse how you suggested with dynamicTyping set to true. Check it out! I was able to make an extract from the Bing Covid file this issue was opened about (~180MB - https://media.githubusercontent.com/media/microsoft/Bing-COVID-19-Data/master/data/Bing-COVID19-Data.csv) in about a minute.

rferraton commented 3 years ago

It work great ! It is an excellent idea. Fast Mode for bad files and Loose Typed Mode for good big files :-)

I think next step could be to save the world ... I mean use less storage and allow to have compressed csv as source to save storage (and network and energy).

What do you think about that ?

Regards

KeshiaRose commented 3 years ago

Oooh interesting, haven't thought about that one, do you have an example? Not sure if there is enough need for that use case.

rferraton commented 3 years ago

Yes it is used ! And to my mind, i think compression should be use more intensively (to save space, network and energy).

For data example , you could find 2020 France real estate solds here : https://files.data.gouv.fr/geo-dvf/latest/csv/2020/

A small file (for one county in 2020) ==> https://files.data.gouv.fr/geo-dvf/latest/csv/2020/departements/01.csv.gz A bigger file for all of France ==> https://files.data.gouv.fr/geo-dvf/latest/csv/2020/full.csv.gz

Regards

KeshiaRose commented 3 years ago

Interesting! I don't have much time right now to work on this but I could see this getting added in the future!

KeshiaRose / Basic-CSV-WDC

did'nt getting the data #5