USEPA / emf

Emissions Modeling Framework (EMF)
6 stars 3 forks source link

handle line breaks in description when exporting files #131

Open cseppan opened 1 year ago

cseppan commented 1 year ago

Exported a dataset with a line break in the description. The exported file starts with the lines below.

#Unit-level control for the emissions units in the cement, glass, iron and steel industries, as well as the boilers and engines.
For 2026 Transport Rule#EXPORT_DATE=Thu Apr 06 12:57:49 EDT 2023
#EXPORT_VERSION_NAME=Change reductions at 3986511
#EXPORT_VERSION_NUMBER=2
...

Trying to import this into another EMF system gave the error:

Exception: Number of columns in the column header doesn't match the file format (expected:33 but was:1). Hint: correct header typos or set "Dataset Type" keyword, EXPORT_COLUMN_LABEL, to false if there is no column header

Probably should just remove any line breaks in the description when exporting datasets.

ddelvecchio commented 6 months ago

Made changes to the GenereicExporter class to strip off new line characters within a header item

cseppan commented 3 months ago

I tested this updated code using a dummy dataset with a description like this:

#FORMAT=FF10_POINT
abc
#COUNTRY=US
#YEAR     2030

The exported output file looks like this:

#FORMAT=FF10_POINT ab#COUNTRY=US
#YEAR     2030

Looking through the revised code:

    protected void writeHeaders(PrintWriter writer, Dataset dataset, DataFormatFactory localDataFormatFactory) throws SQLException {
        String header = dataset.getDescription();
        String cr = System.getProperty("line.separator");

        if (header != null && !header.trim().isEmpty()) {
            StringTokenizer st = new StringTokenizer(header, "#");
            String lasttoken = "";
            while (st.hasMoreTokens()) {
                lasttoken = st.nextToken();
                if (!(StringUtils.isNotBlank(lasttoken) && lasttoken.substring(0, lasttoken.length() - 2).contains(cr))) {
                    writer.print("#" + lasttoken);
                } else {
                    writer.print("#" + lasttoken.substring(0, lasttoken.length() - 2).replace(cr, " ") + lasttoken.substring(lasttoken.length() - 1, lasttoken.length() - 1));
                }
            }

            if (lasttoken.indexOf(cr) < 0)
                writer.print(cr);
        }

        printExportInfo(writer, localDataFormatFactory);
    }

It looks like the updated code expects line breaks to be two characters, which it would be if the server were running on Windows, but not Linux or macOS.

lasttoken.substring(0, lasttoken.length() - 2)

Also, this chunk of code seems like it's trying to output the line break but it'll return a zero-length string since endIndex = beginIndex.

lasttoken.substring(lasttoken.length() - 1, lasttoken.length() - 1)

One other potential issue: I'm not sure if line breaks are converted to a consistent value before being stored in the database. For instance, if someone sets a dataset description on Windows, do the line breaks get saved as CR+LF? If so, working with the system's line separator property wouldn't be enough.

cseppan commented 3 months ago

Apart from the line ending handling, another issue is the column header row, which shouldn't get modified. If the dataset description looked like this:

# DESC some text
and more text
"country_cd","region_cd","tribal_code"

Ideally it would be output as

# DESC some text and more text
"country_cd","region_cd","tribal_code"

This seems like something only a user would be able to accurately identify. For now, I'm going to revert commit 525cb15 for the v4.3 release.