RMLio / dataio

Library to handle data input and output for e.g. RML processors
MIT License
2 stars 1 forks source link

CSVRecord failing when field/header is missing #5

Open enriquedelpino opened 1 week ago

enriquedelpino commented 1 week ago

Hi all,

Us in Graph.Build have been users of your project for quite a long time already. We are working on a use case, where a mapping transformation for a CSV file expects to find a determinate field (defined by a column header), but in practicality, we cannot assume all the CSV files being transformed are complete in definition. They might have a column missing, as this column might be optional.

In these scenarios, we would like to carry on with the transformation even if it's partial, but under the current code, we see the following lines in the get method on CSVRecord, cause the whole transformation to break.

if (!this.data.containsKey(toDatabaseCase)) { throw new IllegalArgumentException(String.format("Mapping for %s not found, expected one of %s", toDatabaseCase, data.keySet())); }

We believe this code should be more resilient, log out the missing field, but still return an empty list, to avoid stopping the transformation running.

` @Override public List get(String reference) { String toDatabaseCase; if (this.data.containsKey(reference.toUpperCase())) { toDatabaseCase = reference.toUpperCase(); } else if (this.data.containsKey(reference.toLowerCase())) { toDatabaseCase = reference.toLowerCase(); } else { toDatabaseCase = reference; } if (!this.data.containsKey(toDatabaseCase)) { logger.warn(String.format("Mapping for %s not found, expected one of %s", toDatabaseCase, data.keySet())); return List.of(); } String obj = this.data.get(toDatabaseCase);

    if (obj == null) return List.of();
    return List.of(obj);
}`

I've been comparing the current definitions of the different Record implementations (JSON, XML, Excel, etc.) to the older class definitions, for instance in your old RML Mapper v 5.0, and it seems this same error was wide spread among the other record types, but it has been addressed in the new dataio project, which leads me to believe, my proposition actually aligns to your general position in the rest of record types when it comes to a missing field on the input file.

What are your thoughts on this?

Kind Regards,
Enrique
ghsnd commented 1 week ago

Hi Enrique,

The fact that different record implementations behave differently when encountering an 'unknown' reference is an issue that we'll address by the next release. Thanks for pointing this out!

The "fuzzy" reference matching is something we'll have a closer look at, because also here consistent behaviour over every record type is desired.

Best regards, Gerald

enriquedelpino commented 3 days ago

Thank you very much for your answer, I appreciate it very much.

I have created a PR that addresses this issue I mentioned above in case you want to consider my contribution in order to address this if that actually helps you in this process.

https://github.com/RMLio/dataio/pull/8

Please, let me know your thoughts on this.