fubar-coder / beanio

Automatically exported from code.google.com/p/beanio
Apache License 2.0
0 stars 0 forks source link

Add support for filtering rows in a delimited file (wish list) #48

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
I have a few very large files where I'm only interested in parsing a small 
subset of the rows into beans. Parsing all the rows using BeanIO then filtering 
them out using Java has higher memory requirements than filtering the rows as 
they're being parsed. My request is for expression-based filtering on 
individual fields like:

<field name="mappingType" position="1" filter="INTERESTING_RECORD"/>

In this case, only rows with the string literal "INTERESTING_RECORD" as the 
second column would be parsed; all other rows would be skipped.

Original issue reported on code.google.com by d...@daveboden.com on 3 Jan 2013 at 10:15

GoogleCodeExporter commented 9 years ago
This is somewhat already supported.  Any matched records (identified by fields 
where rid="true") that are not explicitly bound to a class are skipped.

However, the fact that you must declare record mappings for ignored records is 
cumbersome (but still recommended if you are at all worried about the integrity 
of the file).  It is also difficult to match records based on a field that does 
NOT match a particular regex pattern or literal.  So there are two things I'm 
going to look into:

1.  A stream level setting for ignoring unmatched records.
2.  A field level setting to "inverse" the matched literal or regex pattern, 
something like not="true".

And one point of clarification, filtering unmarshalled objects as opposed to 
records does NOT increase memory requirements assuming you are processing the 
file one record at a time (which is recommended for large files).  A better 
argument would be an increase in CPU utilization, although probably not 
substantial.

Original comment by kevin.s...@gmail.com on 4 Jan 2013 at 2:45

GoogleCodeExporter commented 9 years ago

Original comment by kevin.s...@gmail.com on 4 Jan 2013 at 2:46

GoogleCodeExporter commented 9 years ago
Maybe you can expose an API for marshalling/unmarshlling to call a user-defined 
callback.  This way, filtering can be implemented, or encryption/decryption, or 
other processing.

Original comment by cw10...@gmail.com on 23 Jan 2013 at 7:58

GoogleCodeExporter commented 9 years ago
Proceeding with functionality to ignore unidentified records for 2.0.4.

@cw10025 - Your request is kind of vague.  Could you provide a sample callback 
interface that you would like to see if you're still interested?

Original comment by kevin.s...@gmail.com on 2 Feb 2013 at 5:04

GoogleCodeExporter commented 9 years ago
Thanks Kevin; we got it working by:
  * Setting the Camel BeanIO Data Format to setUnexpectedRecords(true);
  * Suppressing the SLF4J WARN logs for the BeanIODataFormat class.

Looking forward to 2.0.4 where we'll potentially be able to remove those 
workarounds.

Original comment by d...@daveboden.com on 22 Feb 2013 at 1:42

GoogleCodeExporter commented 9 years ago

Original comment by kevin.s...@gmail.com on 6 Mar 2013 at 3:13

GoogleCodeExporter commented 9 years ago
I've upgraded to beanio 2.0.4 but am still seeing the WARN level messages. e.g.:

[main] WARN org.apache.camel.dataformat.beanio.BeanIODataFormat - BeanIO: 
Unexpected record 'header' at line 56642: 498108      Alt Code        ABCDEFG

Please consider reopening this issue. Thanks.

Original comment by d...@daveboden.com on 12 Mar 2013 at 12:53

GoogleCodeExporter commented 9 years ago
Can you provide a mapping file and sample input that reproduces the issue?  
Note that an unexpected record exception is typically a record out of order and 
is not the same thing as an unidentified record.

Thanks,
Kevin

Original comment by kevin.s...@gmail.com on 12 Mar 2013 at 3:05