Appendium / flatpack

CSV/Tab Delimited and Fixed Length Parser and Writer
http://flatpack.sf.net
Apache License 2.0
57 stars 20 forks source link

Issues when parsing CSV files that extensively use text qualifiers #60

Open mvlakh opened 3 years ago

mvlakh commented 3 years ago

Hi,

I am using Camel to process CSV files and as I understand Camel utilizes Flatpack to parse CSV content. It looks like there are several defects in Flatpack that do not allow to parse CSV files properly if they use text qualifier a lot, looks like there are several edge cases when the library cannot handle content properly:

benoitx commented 3 years ago

Hi

Thanks for your report. I will gratefully accept a PR for a unit test (and a fix as well, if you can but test is good).

As far as I can see, the second example is not a valid CSV due to space before the double quote: Smith,"""Test....

Does Flatpack work if you remove that space?

Thanks

Benoit

On Mon, 26 Apr 2021 at 11:18, mvlakh @.***> wrote:

Hi,

I am using Camel to process CSV files and as I understand Camel utilizes Flatpack to parse CSV content. It looks like there are several defects in Flatpack that do not allow to parse CSV files properly if they use text qualifier a lot, looks like there are several edge cases when the library cannot handle content properly:

  • if there multiline string like this one the library handles it incorrectly:

@.***,"This is a long fragment of text that should be processed as a single field", 1988, 111-222-33,"another field with new line character that should be considered as a field of the same data row"

It looks like it tries to consume it as separate CSV rows and not as one row

  • if string starts with or contains escaped text qualifier characters that are part of the string value, the library tries to consume one string as several separate cells:

Bob, Smith, """Test"" , 2, Some string, still string, also part of the string.",11111111

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Appendium/flatpack/issues/60, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB542OGYCA6MM2XJGMHN7DTKU4WVANCNFSM43SSKJKQ .

mvlakh commented 3 years ago

Hello Benoit,

That space was a typo, in my case it was not present so I undated my past comment. Prepared a pull request with 2 unit tests that replicate the issues and one fix suggestion

Thank you, Mykhailo

benoitx commented 3 years ago

Hi Mykhailo

I'm sorry if I misunderstood but I only looked at the merged code now in more details and it seems that the test testCsvDocumentWithMultilineString if failing?

Didn't you provide a fix at the same time with the Pull Request?

Could you kindly double check?

Many thanks

Benoit

On Mon, 26 Apr 2021 at 15:13, mvlakh @.***> wrote:

Hello Benoit,

That space was a typo, in my case it was not present so I undated my past comment. Prepared a pull request with 2 unit tests that replicate the issues and one fix suggestion

Thank you, Mykhailo

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Appendium/flatpack/issues/60#issuecomment-826869096, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB542JZBMW5WWXWGQC3EUDTKVYIJANCNFSM43SSKJKQ .

mvlakh commented 3 years ago

Hello Benoit,

Yes it looks like you misunderstood me a bit. My pull requested contained 2 tests ,that reproduce 2 issues, and a fix only for one of these issues. So it is expected that the other test testCsvDocumentWithMultilineString still fails.

As a workaround for this issue with multiline string, that does not have a fix as of now, I implemented my own version of the function fetchNextRecord and injected it via inheritance. I do not know if my alternative implementation is completely correct but it solved the issues I had.

Attaching the class here for you to take a look. If this code makes sense you can try to use it instead of the existing impl to see how well it works for all the cases.

I do not want to create a pull request because this is an experimental change and I do not know if you can consume it as whole, most likely you will need to completely revisit this impl.

Thank you, Mykhailo

DelimiterParser.txt

benoitx commented 3 years ago

Thanks for getting back to me.

I have modified the base code following your suggestion but the CsvParserTest is still failing on the getRowCount() (L88) it is strange... does it pass for sure with your code?

I must have done something stupid...

Thanks

Benoit

On Thu, 27 May 2021 at 08:22, mvlakh @.***> wrote:

Hello Benoit,

Yes it looks like you misunderstood me a bit. My pull requested contained 2 tests ,that reproduce 2 issues, and a fix only for one of these issues. So it is expected that the other test testCsvDocumentWithMultilineString still fails.

As a workaround for this issue with multiline string, that does not have a fix as of now, I implemented my own version of the function fetchNextRecord and injected it via inheritance. I do not know if my alternative implementation is completely correct but it solved the issues I had.

Attaching the class here for you to take a look. If this code makes sense you can try to use it instead of the existing impl to see how well it works for all the cases.

I do not want to create a pull request because this is an experimental change and I do not know if you can consume it as whole, most likely you will need to completely revisit this impl.

Thank you, Mykhailo

DelimiterParser.txt https://github.com/Appendium/flatpack/files/6551895/DelimiterParser.txt

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Appendium/flatpack/issues/60#issuecomment-849399787, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB542LLQ2WHZWG2AB4C5LDTPXXJPANCNFSM43SSKJKQ .

mvlakh commented 3 years ago

Hi,

Sorry for a delay, had a very busy week.

I made a mistake in my test and this is the reason it does not work for you. I have created a pull request that contains fixes for that test and my custom fetch next record function impl that fixes that test. Please take a look at it and see if that can help you to clarify everything.

The alternative fetch function implementation seems work fine, at least it solves my issues and I have not issues with it. But I cannot be sure that it covers all the cases.

Thank you, Mykhailo