MRC-CSO-SPHSU / simpaths_parallel

Code for running SimPaths in parallel
0 stars 0 forks source link

Parquet support #10

Open vkhodygo opened 1 year ago

vkhodygo commented 1 year ago

This should close #5.

I'd like to know what you think about this.

andrewbaxter439 commented 1 year ago

This looks great! I'll aim to try testing it out ASAP and can likely be pulled. In previous tries of converting to parquet the three problems faced were:

As far as I can see your code seems to address all the above.

andrewbaxter439 commented 1 year ago

Suggestions:

https://github.com/MRC-CSO-SPHSU/simpaths_parallel/blob/efb4cfa21d1aeb13c506996766a063004a5189e4/R/convert_parquet.R#L72-L75

https://github.com/MRC-CSO-SPHSU/simpaths_parallel/blob/efb4cfa21d1aeb13c506996766a063004a5189e4/R/convert_parquet.R#L194

https://github.com/MRC-CSO-SPHSU/simpaths_parallel/blob/efb4cfa21d1aeb13c506996766a063004a5189e4/R/convert_parquet.R#L121

https://github.com/MRC-CSO-SPHSU/simpaths_parallel/blob/efb4cfa21d1aeb13c506996766a063004a5189e4/R/convert_parquet.R#L198

https://github.com/MRC-CSO-SPHSU/simpaths_parallel/blob/efb4cfa21d1aeb13c506996766a063004a5189e4/R/convert_parquet.R#L7

I've noted some of the above suggestions as tick-boxes. Check off any you don't think need actioned and we can discuss/patch others as needed.

vkhodygo commented 1 year ago

I've fixed some of the issues, I also replaced a hardcoded postfix with a parameter, however, other methods do not incorporate this change yet.

We could employ furrr::future_map, but I have a few concerns. arrow is inherently parallel itself, and I don't want to create a cluster of threads that spawn threads of their own. This approach leads to potential bottlenecks and crashes, better safe than sorry. Besides, this code is reasonably performant as it is, at the moment it's about an hour per scenario.

Regarding your last point, I'm not sure how it works in R. Any suggestions?

P.S. I also would like you to take a look at convert.single_simulation. It's in need of code consolidation (switch?), could you please give it a go?

andrewbaxter439 commented 1 year ago

Hi Vlad, thanks for update. I was working on this the other week and hit some barriers to progress at the same time as my simpaths parallel runs stopped working all over again as I attempted to generate a test dataset to work on. Spent hours watching it repeatedly crash after a tiny number of runs, but then noticed that the home drive is 99% full?? Ran out of time to give this to solve that week. Will pick this up again in the coming weeks.

vkhodygo commented 1 year ago

but then noticed that the home drive is 99% full

That's my fault, I apologize. For some reason quotas don't work, I also need a lot of space to store ML models and produced synthetic results. Meanwhile, try using /storage for that.

andrewbaxter439 commented 1 year ago

Finally got round to testing this after a lot of problems. I think with the tweaks in #12 it works perfectly in giving useable parquet files for baseline/reform runs.