Appendium / flatpack

CSV/Tab Delimited and Fixed Length Parser and Writer
http://flatpack.sf.net
Apache License 2.0
57 stars 20 forks source link

Wrong values for mapped columns #53

Closed dmitryallen closed 4 years ago

dmitryallen commented 4 years ago

Hello, trying to parse large file. Using mapping XML. When file is parsed I'm trying to get values by column name, but it returns wrong values, for example for column "Program" it's returning a date, value from previous column "RefDate". "PRIMARY_PHONE_NUM" return value of MEMBERID

Thanks Dmitry

POM:

<dependency>
            <groupId>net.sf.flatpack</groupId>
            <artifactId>flatpack</artifactId>
            <version>4.0.4</version>
</dependency>

Code:

    try {
        File file = ResourceUtils.getFile("classpath:ColumnMapping.xml");
        mappingFileName = file.getAbsolutePath();
    }
    catch (Throwable e)
    {
        throw new RuntimeException("Cannot read mapping", e);
    }

    try (FileReader pzmap = new FileReader(mappingFileName);
         FileReader fileToParse = new FileReader(fileName);
         BuffReaderDelimParser pzparse = (BuffReaderDelimParser) BuffReaderParseFactory.getInstance()
                 .newDelimitedParser(pzmap, fileToParse,',', FPConstants.NO_QUALIFIER , true)) {
        // delimited by a comma
        // text qualified by double quotes
        // ignore first record
        pzparse.setHandlingShortLines(true);
        pzparse.setIgnoreParseWarnings(true);
        pzparse.setIgnoreExtraColumns(true);

        final DataSet ds = pzparse.parse();

        colNames = ds.getColumns();

        while (ds.next()) {
            for (final String colName : colNames) {
                System.out.println("COLUMN NAME: " + colName + " VALUE: " + ds.getString(colName));
            }

            System.out.println("===========================================================================");
        }

        if (ds.getErrors() != null && !ds.getErrors().isEmpty()) {
            System.out.println("FOUND ERRORS IN FILE");
        }
    } catch (IOException e) {
        e.printStackTrace();
    }

Mapping:

<?xml version='1.0'?>
<!DOCTYPE PZMAP SYSTEM
        "flatpack.dtd" >
<PZMAP>
    <COLUMN name="Program" />
    <COLUMN name="MEMBERID" />
    <COLUMN name="PRIMARY_PHONE_NUM" />
    <COLUMN name="SECONDARY_PHONE_NUM" />
    <COLUMN name="CHANGE_INDICATOR" />
</PZMAP>

CSV:

RefDate,Program,MEMBERID,LNAME,FNAME,DOB,GENDER_CD,ADDR_LINE_1,ADDR_LINE_2,CITY_NM,ST_CD,ZIP_CD,PRIMARY_PHONE_NUM,SECONDARY_PHONE_NUM,LANG_NM,Hearing,PLan_Level,ENROLLED_GR,GR_TOTAL_REWARD_AMOUNT,GR_ACTIVITIES_ HRA1_$10,GR_ACTIVITIES_ HRA2_$10,GR_ACTIVITIES_ AWV_$15,GR_ACTIVITIES_ BONUS_$50,GR_ACTIVITIES_ BCS_$75,GR_ACTIVITIES_ DSC_$100,GR_ACTIVITIES_ CO_$50,TRANSPORTATION,OTC_AMOUNT_Q,OTC_AMOUNT_Y,Flu_shot,MEMBERID2,CHANGE_INDICATOR
2/12/2020,12,22548000*01,P,J L,8/13/1972,F,707 LOUCKS RD,,YORK,PA,17404,7177181215,,ENGLISH,Y,Diamond,N,$75.00 ,N,Y,Y,Y,N,N,N,50,$300 (Diamond),$1200 (Diamond),Y,22548000,
2/12/2020,12,22548000*01,N,L K,9/17/1979,F,5621 HAYS ST,,PITTSBURGH,PA,15206,4125033775,,ENGLISH,N,Diamond,N,$75.00 ,N,Y,Y,Y,N,N,N,50,$300 (Diamond),$1200 (Diamond),N,22548000,
benoitx commented 4 years ago

Hi Dmitry

Thank you for taking the time to provide that much information. I will have a look in a few days but as I’ve quickly scanned your email, I can see that the csv file has a first column which is not in the PZMap.

You can parse a csv file without any PZMap, the header would be the column name.

Can you add it and see if that works? Let me know.

Alternatively use the CsvParserFactory.newXXX and use record.getString(“you column name”);

Let me know

Benoît


Important Notice This communication contains information that is considered confidential and may also be privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited and may be unlawful. If you have received this communication in error please return it to the sender and delete the original

On 18 Feb 2020, at 04:11, dmitryallen notifications@github.com wrote:

Dmitry

benoitx commented 4 years ago

Hi

Has my suggestion fixed your issue? I will try to look at the code this weekend.

Benoit

On Tue, 18 Feb 2020 at 10:22, bx@appendium.com wrote:

Hi Dmitry

Thank you for taking the time to provide that much information. I will have a look in a few days but as I’ve quickly scanned your email, I can see that the csv file has a first column which is not in the PZMap.

You can parse a csv file without any PZMap, the header would be the column name.

Can you add it and see if that works? Let me know.

Alternatively use the CsvParserFactory.newXXX and use record.getString(“you column name”);

Let me know

Benoît


Important Notice This communication contains information that is considered confidential and may also be privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited and may be unlawful. If you have received this communication in error please return it to the sender and delete the original

On 18 Feb 2020, at 04:11, dmitryallen notifications@github.com wrote:

Dmitry

dmitryallen commented 4 years ago

Thanks Benoit, unfortunately I had not time to try your solution, I have switched to SuperCSV. Please lose this issue.

Your library catched my attention because of Mapping and I was planing to use it for data ingestion in large database. The headers in my case can vary except small amount of columns which can have different positions in files.

Best regards, Dmitry

On February 25, 2020 at 5:01:09 PM, Benoit Xhenseval ( notifications@github.com) wrote:

Hi

Has my suggestion fixed your issue? I will try to look at the code this weekend.

Benoit

On Tue, 18 Feb 2020 at 10:22, bx@appendium.com wrote:

Hi Dmitry

Thank you for taking the time to provide that much information. I will have a look in a few days but as I’ve quickly scanned your email, I can see that the csv file has a first column which is not in the PZMap.

You can parse a csv file without any PZMap, the header would be the column name.

Can you add it and see if that works? Let me know.

Alternatively use the CsvParserFactory.newXXX and use record.getString(“you column name”);

Let me know

Benoît


Important Notice This communication contains information that is considered confidential and may also be privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited and may be unlawful. If you have received this communication in error please return it to the sender and delete the original

On 18 Feb 2020, at 04:11, dmitryallen notifications@github.com wrote:

Dmitry

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Appendium/flatpack/issues/53?email_source=notifications&email_token=AJROSLLPY2YBIIG4MM3UGC3REWPLJA5CNFSM4KW4WFXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM54LSI#issuecomment-591119817, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJROSLKM5IRWGFO33K27XRTREWPLJANCNFSM4KW4WFXA .

benoitx commented 4 years ago

Thanks for getting back to me Dmitry.

Interesting, one of the powerful features of Flatpack is that the column order is not important and you do not need to know the columns in a 'static' way (e.g. an XML). It does discover the columns, it can even ensure that they are unique if you have 2 "name" columns for instance. the column names can be case insensitive too so: "name" or "NaMe" would be handled in the code with dataSet.getString("name"); also, it handles multi-line CSV which is quite rare.

Anyhow, thanks for the test case, I will improve Flatpack.

I regularly process multi-GB files as stream() of Record via Flatpack, works very well.

Kind regards

Benoit

On Tue, 25 Feb 2020 at 23:26, dmitryallen notifications@github.com wrote:

Thanks Benoit, unfortunately I had not time to try your solution, I have switched to SuperCSV. Please lose this issue.

Your library catched my attention because of Mapping and I was planing to use it for data ingestion in large database. The headers in my case can vary except small amount of columns which can have different positions in files.

Best regards, Dmitry

On February 25, 2020 at 5:01:09 PM, Benoit Xhenseval ( notifications@github.com) wrote:

Hi

Has my suggestion fixed your issue? I will try to look at the code this weekend.

Benoit

On Tue, 18 Feb 2020 at 10:22, bx@appendium.com wrote:

Hi Dmitry

Thank you for taking the time to provide that much information. I will have a look in a few days but as I’ve quickly scanned your email, I can see that the csv file has a first column which is not in the PZMap.

You can parse a csv file without any PZMap, the header would be the column name.

Can you add it and see if that works? Let me know.

Alternatively use the CsvParserFactory.newXXX and use record.getString(“you column name”);

Let me know

Benoît


Important Notice This communication contains information that is considered confidential and may also be privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited and may be unlawful. If you have received this communication in error please return it to the sender and delete the original

On 18 Feb 2020, at 04:11, dmitryallen notifications@github.com wrote:

Dmitry

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub < https://github.com/Appendium/flatpack/issues/53?email_source=notifications&email_token=AJROSLLPY2YBIIG4MM3UGC3REWPLJA5CNFSM4KW4WFXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM54LSI#issuecomment-591119817

, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AJROSLKM5IRWGFO33K27XRTREWPLJANCNFSM4KW4WFXA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Appendium/flatpack/issues/53?email_source=notifications&email_token=AAB542NIOZBOWA6EVG4SCPDREWSJZA5CNFSM4KW4WFXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM57UAA#issuecomment-591133184, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB542MJV2TWTKHUKI7JVYDREWSJZANCNFSM4KW4WFXA .

dmitryallen commented 4 years ago

Thanks Benoit, I hope my bug report will provide you an info for product improvements.

This shift of column data looks like a bug or maybe I missed something in configuration

Best regards, Dmitry

On February 25, 2020 at 5:31:37 PM, Benoit Xhenseval ( notifications@github.com) wrote:

Thanks for getting back to me Dmitry.

Interesting, one of the powerful features of Flatpack is that the column order is not important and you do not need to know the columns in a 'static' way (e.g. an XML). It does discover the columns, it can even ensure that they are unique if you have 2 "name" columns for instance. the column names can be case insensitive too so: "name" or "NaMe" would be handled in the code with dataSet.getString("name"); also, it handles multi-line CSV which is quite rare.

Anyhow, thanks for the test case, I will improve Flatpack.

I regularly process multi-GB files as stream() of Record via Flatpack, works very well.

Kind regards

Benoit

On Tue, 25 Feb 2020 at 23:26, dmitryallen notifications@github.com wrote:

Thanks Benoit, unfortunately I had not time to try your solution, I have switched to SuperCSV. Please lose this issue.

Your library catched my attention because of Mapping and I was planing to use it for data ingestion in large database. The headers in my case can vary except small amount of columns which can have different positions in files.

Best regards, Dmitry

On February 25, 2020 at 5:01:09 PM, Benoit Xhenseval ( notifications@github.com) wrote:

Hi

Has my suggestion fixed your issue? I will try to look at the code this weekend.

Benoit

On Tue, 18 Feb 2020 at 10:22, bx@appendium.com wrote:

Hi Dmitry

Thank you for taking the time to provide that much information. I will have a look in a few days but as I’ve quickly scanned your email, I can see that the csv file has a first column which is not in the PZMap.

You can parse a csv file without any PZMap, the header would be the column name.

Can you add it and see if that works? Let me know.

Alternatively use the CsvParserFactory.newXXX and use record.getString(“you column name”);

Let me know

Benoît


Important Notice This communication contains information that is considered confidential and may also be privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) please note that any form of distribution, copying or use of this communication or the information in it is strictly prohibited and may be unlawful. If you have received this communication in error please return it to the sender and delete the original

On 18 Feb 2020, at 04:11, dmitryallen notifications@github.com wrote:

Dmitry

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <

https://github.com/Appendium/flatpack/issues/53?email_source=notifications&email_token=AJROSLLPY2YBIIG4MM3UGC3REWPLJA5CNFSM4KW4WFXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM54LSI#issuecomment-591119817

, or unsubscribe <

https://github.com/notifications/unsubscribe-auth/AJROSLKM5IRWGFO33K27XRTREWPLJANCNFSM4KW4WFXA

.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/Appendium/flatpack/issues/53?email_source=notifications&email_token=AAB542NIOZBOWA6EVG4SCPDREWSJZA5CNFSM4KW4WFXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM57UAA#issuecomment-591133184 , or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAB542MJV2TWTKHUKI7JVYDREWSJZANCNFSM4KW4WFXA

.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Appendium/flatpack/issues/53?email_source=notifications&email_token=AJROSLLYRURO65WEO4WYKLDREWS5RA5CNFSM4KW4WFXKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEM6AITA#issuecomment-591135820, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJROSLKA3NDXVHTDQA5FTG3REWS5RANCNFSM4KW4WFXA .

benoitx commented 4 years ago

Hi Dmitry

Just an update for the record. I think that the issue is due to a misunderstanding of the interface. Allow me to explain.

If you specify the PZMap then you are actually specifying the columns in sequential order, Flatpack will NOT use the headers that are in the file. So, in your example, you have said that the first column is Program even if the data in the file is 'RefDate'.

One could argue that the factory method should not allow you to specify a PZMap AND whether to skip the first row or not but there might be cases where the column header will always be in the file but the headers NAMES might change but not the order, you would then use the programme as you have defined BUT you must specify every columns or at least the sequence of columns up to the last one you are interested in.

I trust that the explanation makes sense.

No bug here but I will add a test case with your data and example.

Thank you

Benoit