OMOP ES now outputs partitioned parquet datasets. Arrow has a nice way of making sure that it partitions the data and has an easy way to deal with recombining these datasets as well.
Testing
@stefpiatek to add in an example dataset here so that we can use it for testing. If it works on one partition then that should be fine, all of our parquet tests should be moved over to use this
Documentation
Update documentation in the CLI and its README
Dependencies
No response
Details and Comments
I think it should be as simple as this. if the partitioned_dataset_directory has two partitions it would have a 1 and a 2 directory. (e.g. 1/PERSON.parquet..., 2/PERSON.parquet...)
Definition of Done / Acceptance Criteria
OMOP ES now outputs partitioned parquet datasets. Arrow has a nice way of making sure that it partitions the data and has an easy way to deal with recombining these datasets as well.
Testing
@stefpiatek to add in an example dataset here so that we can use it for testing. If it works on one partition then that should be fine, all of our parquet tests should be moved over to use this
Documentation
Update documentation in the CLI and its README
Dependencies
No response
Details and Comments
I think it should be as simple as this. if the partitioned_dataset_directory has two partitions it would have a
1
and a2
directory. (e.g.1/PERSON.parquet...
,2/PERSON.parquet...
)