fmichonneau / francoismichonneau.net

Personal website
https://francoismichonneau.net
MIT License
0 stars 0 forks source link

2022/08/arrow-dataset-creation/ #38

Open utterances-bot opened 1 year ago

utterances-bot commented 1 year ago

Creating an Arrow dataset | François Michonneau

An exploration of the file formats that Arrow can read and write.

https://francoismichonneau.net/2022/08/arrow-dataset-creation/

GuiAlDuS commented 1 year ago

Thanks for this helpful post, François. I did a csv to parquet conversion of a huge csv file (56gb) but the open_dataset() function of the R Arrow library gave me some weird issues with the imported csv. I submitted a bug report but due to the large size of the csv (eBird full dataset) it's difficult to share the file and make the "bug" fully reproducible. I tried awk to double check the csv and the Python Arrow library to import the csv into parquet and they both worked well... only the R library gave me the weird rows. Have you guys in Voltron heard of similar issues? Here is my bug report: https://issues.apache.org/jira/projects/ARROW/issues/ARROW-17432?filter=allopenissues Thanks!

fmichonneau commented 1 year ago

Hi @GuiAlDuS, I added a comment on your Jira issue. See if there is something you can do about keeping your identifiers as integers instead of doubles.

zakmn2022 commented 1 year ago

Hi Fracois, I'm getting this error message in R console when I tried to download t ## download the data can you hint if it works or not?

walk(dates_to_get, download_daily_package_logs_csv)

walk(dates_to_get, download_daily_package_logs_csv) Downloading data for 2022-06-01 ... Error in map(): i In index: 1. Caused by error in download.file(): ! cannot open URL 'https://cran-logs.rstudio.com/2022/2022-06-01.csv.gz' Run rlang::last_error() to see where the error occurred. Warning message: In download.file(url = url, destfile = file, method = "libcurl", : URL 'https://cran-logs.rstudio.com/2022/2022-06-01.csv.gz': status was 'Couldn't connect to server'