NorthConcepts / DataPipeline-Examples

DataPipeline Examples
Apache License 2.0
17 stars 5 forks source link

Choosing only selected columns from my csv file in my reader #14

Closed ishansrivastava closed 4 years ago

ishansrivastava commented 4 years ago

Hello sir, is there a way to choose only selected columns in my reader of CSV file and i also want to change the name of those columns according to the table columns in the DB.But i dont want to achieve this record by record because then it will degrade the performance.

DeleTaylor commented 4 years ago

Hi, you can select fields using:

reader = new TransformingReader(reader)
            .add(new SelectFields("FirstName", "LastName", "Rating"));

And rename fields using:

        reader = new TransformingReader(reader)
                .add(new RenameField("FirstName", "fname"))
                .add(new RenameField("LastName", "lname"));

You'll want to set try setting the JDBC writing batch size to 50, 100, or 1000 records for performance: JdbcWriter.setBatchSize(100).

If you're doing a lot of processing in your pipeline, you can put reading (or writing or any parts of the pipeline) in a separate thread. AsyncReader is available in the Small Business edition and higher.

        // buffers up to 10 MB using another thread;
        // asyncReader.read() will pull from this buffer
        AsyncReader asyncReader = new AsyncReader(reader)
            .setMaxBufferSizeInBytes(1024 * 1024 * 10);

        Job.run(asyncReader, writer);

I normally suggest you not worry about performance unless you run into a problem or have specific up front requirements you need to meet.

See examples:

ishansrivastava commented 4 years ago

Actually i do have up front requirements sir.And i am doing validations also on each record so what would you suggest to make it really efficient in performance

DeleTaylor commented 4 years ago

What up-front performance numbers do you need to meet and what are you seeing instead when you run your pipeline?

ishansrivastava commented 4 years ago

Well, i have to upload 100000 records in like under 10 minutes from CSV to DB and there need to be some validations on each record .As of now my 1000 records are taking around 1.5 minutes.

DeleTaylor commented 4 years ago

What JdbcWriter batch size are you using from my above recommendation?

ishansrivastava commented 4 years ago

Hi Dele, Actually now there is a bigger problem.So, as far as i understood in CSVReader() we need to give the file path but in my case i have sent the code on the central server not local.So now how will i read my file because over there i can't give the file path .Because there i am choosing a file which needs to be uploaded is there a way that i can get the file path or now I can't use CSV Reader?

DeleTaylor commented 4 years ago

Take a look at the CSVReader's constructors, passing a File is not the only option.

See: CSVReader JavaDoc

ishansrivastava commented 4 years ago

Hi Dele, Thank you so much as you adviced i used batchSize() and its inserting 10000 records in just 4 seconds now.

DeleTaylor commented 4 years ago

That's great to hear.