fubar-coder / beanio

Automatically exported from code.google.com/p/beanio
Apache License 2.0
0 stars 0 forks source link

How to handle variable field length in fixed-length file? #82

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Yes, I realize how ridiculous that sounds :) My problem is I have a 
fixed-length file with many optional fields at the end of some records. 
Sometimes a file provider just trims whitespace at the end of a line - so for a 
simple example I may have a record with just a rid and a field:

<field name="recordPrefix" length="3" rid="true" literal="PER" />
<field name="artistName" length="45" />

But instead of getting the line I'd want with whitespace:
"PERMadonna             ...."
I get:
"PERMadonna"

This results in the exception:
org.beanio.InvalidRecordGroupException: Invalid 'workGroup' record group at 
line 58
 ==> Invalid 'performingArtists' record at line 75
     - Invalid 'artistName':  Invalid field length, expected 45 characters

So I'd like to just process that record as if it were configured:
<field name="artistName" minlength="1" maxlength="45" />
but that seems to (rightly) be disallowed for fixedlength streams.

I'm new to BeanIO, so I haven't explored all the exception handling options in 
detail... but from what I see an exception handler would only allow me to 
continue processing the file - not to successfully process this record. Any 
suggestions? Thanks in advance...

Original issue reported on code.google.com by matthewe...@gmail.com on 17 Jun 2013 at 6:17

GoogleCodeExporter commented 9 years ago
You can try setting length="unbounded".

From the reference guide: "The length of the last field in a fixed length 
record may be set to 'unbounded' to disable padding and allow a single variable 
length field at the end of the otherwise fixed length record."

Original comment by kevin.s...@gmail.com on 17 Jun 2013 at 7:44

GoogleCodeExporter commented 9 years ago
Thanks... yea I won't always know which is the last field. I could probably 
just  implement a stream reader to pass to the BeanReader constructor so I can 
pad on the way in and they don't raise exceptions to begin with. I really 
should ask the provider(s) to not do that - but thought I'd try out some 
solutions first. 

Original comment by matthewe...@gmail.com on 18 Jun 2013 at 1:36

GoogleCodeExporter commented 9 years ago
Why don't you know the last field in the record?  I'm assuming its because you 
have one or more optional fields...?

In that case, this post might help you -> 
https://groups.google.com/forum/#!searchin/beanio/padding$20a$20record/beanio/Tv
f23G1eAr8/-IMO2JrGxpsJ

Let me know if does, and perhaps I can incorporate that enhancement into the 
code using a 'forceLength' parser setting.

Thanks,
Kevin

Original comment by kevin.s...@gmail.com on 19 Jun 2013 at 1:44

GoogleCodeExporter commented 9 years ago
That's exactly right... several optional fields. The link you gave looks 
exactly like what I need - I was actually wondering if I could extend a BeanIO 
reader or parser to handle this, so that seems to answer my question. I think 
that will probably work - although I have a multi-schema file with about 10 
record types of different lengths, so I'll need to either define a different 
recordLength properties for each record type or calculate the record length 
based on the field definitions in the config. I'll put together a quick PoC 
defining each recordLength separately and then determine which one to use 
within the pad() method based on the record text. 

Thanks so much for your help. 

Original comment by matthewe...@gmail.com on 19 Jun 2013 at 10:06

GoogleCodeExporter commented 9 years ago
Extending the parser won't get you any access to the record declarations, so 
you might have to just pad for the worst case scenario (i.e. the longest 
record).

I'll investigate for 2.1.x whether I could add a new record attribute that 
would enable this sort of thing, since you're not the first to ask.. 

Original comment by kevin.s...@gmail.com on 20 Jun 2013 at 2:36

GoogleCodeExporter commented 9 years ago
That worked perfectly... for now I just defined a Hash in my custom parser with 
the expected lengths of each record type - then I look up the appropriate one 
within the pad method. All my record ids are just the first 3 chars of the line 
so that's not too bad. Would be convenient in this case to have access to the 
record declarations but the parser shouldn't have knowledge of that anyway so 
that's fine. 

Thanks so much for your help. If you put in a new record attribute in a future 
release we would definitely use it. Very impressed with BeanIO so far, and I 
think we're going to use it pretty extensively going forward. If so I'll follow 
up to find about becoming a contributor (probably more likely I'll have one of 
my senior devs be the contributor) if you could use more commiters on the 
project. 

Original comment by matthewe...@gmail.com on 21 Jun 2013 at 6:32

GoogleCodeExporter commented 9 years ago
Added the 'lenientPadding' field attribute for release 2.1.0.M2.

Sample usage:
<stream name="s" format="fixedlength" strict="true">
    <record name="record" class="org.beanio.beans.Bean">
        <field name="field1" length="3" />
        <field name="field2" length="3" minOccurs="0" lenientPadding="true" />
         <field name="field3" length="3" minOccurs="0" lenientPadding="true" />
     </record>
</stream>

Allows "aaabb",  "aaabb c" and "aa ", but not "aa".

Snapshot JAR attached.

Original comment by kevin.s...@gmail.com on 15 Jul 2013 at 2:04

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by kevin.s...@gmail.com on 15 Jul 2013 at 2:04

GoogleCodeExporter commented 9 years ago

Original comment by kevin.s...@gmail.com on 16 Nov 2013 at 5:29