Widen / tabitha

Tabular data reading, writing, and processing library for JVM languages.
MIT License
2 stars 1 forks source link

Tabitha starting halfway through CSV instead of at the beginning #20

Closed twilco closed 6 years ago

twilco commented 6 years ago

Using the attached CSV and iterating through it with Tabitha (version 0.2.0), it appears to be starting halfway through the document rather than at the beginning. Here's my code:

try (InputStream delimFileStream = fileUrl.openStream();
     RowReader reader = RowReaderFactory.open(delimFileStream)
                                        .orElseThrow(() -> new OpeningReaderException(String.format(
                                            "Could not open reader for URL (%s), MIME type of %s.",
                                            url,
                                            fileMime))
                                        )
                                        .withInlineHeaders()) {

    Row firstRow = reader.read().orElse(null);
    firstRow.header().ifPresent(header -> log.info("Header: " + Arrays.toString(header.toArray())));

    reader.forEach(row -> {
        List<String> strs = new ArrayList<>();
        row.forEach(cell -> strs.add(cell.getString().orElse("empty")));
        log.info(strs.toString());
    });
}

This prints the following:

Header: [, CA, United States, 12/1/03 19:13, 2/5/09 22:11, 35.36583, -120.84889]
[1/4/09 16:59, Product1, 1200, Visa, Amy, Parramatta, New South Wales, Australia, 1/3/09 22:35, 2/5/09 22:44, -33.8166667, 151]
[1/30/09 11:56, Product1, 1200, Mastercard, Whitney, Dumbleton, England, United Kingdom, 7/31/08 13:46, 2/6/09 0:04, 52.0166667, -1.9666667]
[1/6/09 5:10, Product1, 1200, Visa, Astrid, Altlengbach, Lower Austria, Austria, 6/24/08 0:49, 2/6/09 0:37, 48.15, 15.9166667]
[1/14/09 3:39, Product1, 1200, Visa, jo, Ballincollig, Cork, Ireland, 12/10/08 7:41, 2/6/09 2:36, 51.8833333, -8.5833333]
.... roughly 500 more items...
[1/8/09 11:55, Product1, 1200, Diners, julie, Haverhill, England, United Kingdom, 11/29/06 13:31, 3/1/09 7:28, 52.0833333, 0.4333333]
[1/12/09 21:30, Product1, 1200, Visa, Julia , Madison                     , WI, United States, 11/17/08 22:24, 3/1/09 10:14, 43.07306, -89.40111]

What Tabitha is registering as the header (the first row it has found) is actually line 526 in the CSV.

Turning this CSV into an XLSX via Microsoft Excel and then using that as the input to the code above works correctly (ignore the "empty" strings - I realize that's my problem and not a Tabitha problem):

Header: [Transaction_date, Product, Price, Payment_Type, Name, City, State, Country, Account_Created, Last_Login, Latitude, Longitude]
[empty, Product1, empty, Visa, Betina, Parkville                   , MO, United States, empty, empty, empty, empty]
[empty, Product1, empty, Mastercard, Federica e Andrea, Astoria                     , OR, United States, empty, empty, empty, empty]
...all the rest of the items...

CSV is attached as a .txt, as Github does not allow CSVs to be added as attachments.

SalesJan2009.txt salesjan2009xl.xlsx

sagebind commented 6 years ago

I am able to reproduce this behavior. Oddly enough, using a File instead of an InputStream fixes the issue. This works:

import com.widen.tabitha.RowReaderFactory

RowReaderFactory.open(new File("SalesJan2009.csv")).get().withInlineHeaders().withCloseable { reader ->
    def header = false

    reader.forEach {
        if (!header) {
            header = true
            println("header: " + it.header().get())
        }
        println("row: " + it)
    }
}

This does not:

import com.widen.tabitha.RowReaderFactory

RowReaderFactory.open(new FileInputStream("SalesJan2009.csv")).get().withInlineHeaders().withCloseable { reader ->
    def header = false

    reader.forEach {
        if (!header) {
            header = true
            println("header: " + it.header().get())
        }
        println("row: " + it)
    }
}