Parquet support - Githubissues

vkhodygo commented 1 year ago

This should close #5.

I'd like to know what you think about this.

andrewbaxter439 commented 1 year ago

This looks great! I'll aim to try testing it out ASAP and can likely be pulled. In previous tries of converting to parquet the three problems faced were:

Integer values with "null" being wrongly coded as "character"
Specifying schema for such cols as integer causing problems with "null" character values in column
Trailing commas being interpreted as empty column

As far as I can see your code seems to address all the above.

andrewbaxter439 commented 1 year ago

Suggestions:

https://github.com/MRC-CSO-SPHSU/simpaths_parallel/blob/efb4cfa21d1aeb13c506996766a063004a5189e4/R/convert_parquet.R#L72-L75

[x] replace with file.path(scenario_id, "output", path_string, "csv") as a safer, system-agnostic, path specification?

https://github.com/MRC-CSO-SPHSU/simpaths_parallel/blob/efb4cfa21d1aeb13c506996766a063004a5189e4/R/convert_parquet.R#L194

This will stop working in 2024!
[x] A less specific file path of say "^20\\d+_\\d+$" will still catch all directories of runs (until we run it in the year 2100) without any danger of missing runs with either higher seeds or seeds not ending in "00"

https://github.com/MRC-CSO-SPHSU/simpaths_parallel/blob/efb4cfa21d1aeb13c506996766a063004a5189e4/R/convert_parquet.R#L121

This makes sense, and if I'm following it correctly it will renumber the runs 1-1000 (rather than my old system of having 100-119, 200-219, ... 5000-5019 etc.)
[x] Minor future compatibility issue will be if the batch size -b passed to do_full_run.sh changes from 20 at all this will break the hard-coded numbering system. Might be too niche a problem to worry about for now

https://github.com/MRC-CSO-SPHSU/simpaths_parallel/blob/efb4cfa21d1aeb13c506996766a063004a5189e4/R/convert_parquet.R#L198

[x] furrr::future_map might be useful here to parallelise? Haven't tried running this yet - my own implementations used the annoying read_csv_arrow which loaded whole dataset (but didn't need schemas), whereas open_csv_arrow I presume should be faster/smoother - so parallelisation might not be possible/needed.

https://github.com/MRC-CSO-SPHSU/simpaths_parallel/blob/efb4cfa21d1aeb13c506996766a063004a5189e4/R/convert_parquet.R#L7

[ ] will check how this runs but I would have thought it should be source("R/schemas/schemas.R") if running from project root?

I've noted some of the above suggestions as tick-boxes. Check off any you don't think need actioned and we can discuss/patch others as needed.

vkhodygo commented 1 year ago

I've fixed some of the issues, I also replaced a hardcoded postfix with a parameter, however, other methods do not incorporate this change yet.

[x] TODO: finish this bit

We could employ furrr::future_map, but I have a few concerns. arrow is inherently parallel itself, and I don't want to create a cluster of threads that spawn threads of their own. This approach leads to potential bottlenecks and crashes, better safe than sorry. Besides, this code is reasonably performant as it is, at the moment it's about an hour per scenario.

Regarding your last point, I'm not sure how it works in R. Any suggestions?

P.S. I also would like you to take a look at convert.single_simulation. It's in need of code consolidation (switch?), could you please give it a go?

andrewbaxter439 commented 1 year ago

Hi Vlad, thanks for update. I was working on this the other week and hit some barriers to progress at the same time as my simpaths parallel runs stopped working all over again as I attempted to generate a test dataset to work on. Spent hours watching it repeatedly crash after a tiny number of runs, but then noticed that the home drive is 99% full?? Ran out of time to give this to solve that week. Will pick this up again in the coming weeks.

vkhodygo commented 1 year ago

but then noticed that the home drive is 99% full

That's my fault, I apologize. For some reason quotas don't work, I also need a lot of space to store ML models and produced synthetic results. Meanwhile, try using /storage for that.

andrewbaxter439 commented 1 year ago

Finally got round to testing this after a lot of problems. I think with the tweaks in #12 it works perfectly in giving useable parquet files for baseline/reform runs.

MRC-CSO-SPHSU / simpaths_parallel

Parquet support #10