datastax / dsbulk

DataStax Bulk Loader (DSBulk) is an open-source, Apache-licensed, unified tool for loading into and unloading from Apache Cassandra(R), DataStax Astra and DataStax Enterprise (DSE)
Apache License 2.0
83 stars 30 forks source link

Support URL files with up to millions of lines #457

Open adutra opened 1 year ago

adutra commented 1 year ago

This came up while reviewing #399: some users are using giant urlfiles with millions of URLs inside.

This file size isn't how urlfiles were designed to work: indeed currently when a urlfile is parsed, all the parsed URL instances are stored in memory. See AbstractFileBasedConnector#loadURLs.

We should modify that method to return a Flux instead, and merge it with other fluxes.

┆Issue is synchronized with this Jira Task by Unito