Open vkhodygo opened 1 year ago
This looks great! I'll aim to try testing it out ASAP and can likely be pulled. In previous tries of converting to parquet the three problems faced were:
As far as I can see your code seems to address all the above.
Suggestions:
file.path(scenario_id, "output", path_string, "csv")
as a safer, system-agnostic, path specification?"^20\\d+_\\d+$"
will still catch all directories of runs (until we run it in the year 2100) without any danger of missing runs with either higher seeds or seeds not ending in "00"-b
passed to do_full_run.sh
changes from 20 at all this will break the hard-coded numbering system. Might be too niche a problem to worry about for nowfurrr::future_map
might be useful here to parallelise? Haven't tried running this yet - my own implementations used the annoying read_csv_arrow
which loaded whole dataset (but didn't need schemas), whereas open_csv_arrow
I presume should be faster/smoother - so parallelisation might not be possible/needed.source("R/schemas/schemas.R")
if running from project root?I've noted some of the above suggestions as tick-boxes. Check off any you don't think need actioned and we can discuss/patch others as needed.
I've fixed some of the issues, I also replaced a hardcoded postfix with a parameter, however, other methods do not incorporate this change yet.
We could employ furrr::future_map
, but I have a few concerns. arrow
is inherently parallel itself, and I don't want to create a cluster of threads that spawn threads of their own. This approach leads to potential bottlenecks and crashes, better safe than sorry. Besides, this code is reasonably performant as it is, at the moment it's about an hour per scenario.
Regarding your last point, I'm not sure how it works in R. Any suggestions?
P.S. I also would like you to take a look at convert.single_simulation
. It's in need of code consolidation (switch
?), could you please give it a go?
Hi Vlad, thanks for update. I was working on this the other week and hit some barriers to progress at the same time as my simpaths parallel runs stopped working all over again as I attempted to generate a test dataset to work on. Spent hours watching it repeatedly crash after a tiny number of runs, but then noticed that the home drive is 99% full?? Ran out of time to give this to solve that week. Will pick this up again in the coming weeks.
but then noticed that the home drive is 99% full
That's my fault, I apologize. For some reason quotas don't work, I also need a lot of space to store ML models and produced synthetic results. Meanwhile, try using /storage
for that.
Finally got round to testing this after a lot of problems. I think with the tweaks in #12 it works perfectly in giving useable parquet files for baseline/reform runs.
This should close #5.
I'd like to know what you think about this.