GSA / ckanext-geodatagov

data.gov extension
Other
36 stars 39 forks source link

transformation error causing USGS records to not get harvested #91

Open amilan17 opened 9 years ago

amilan17 commented 9 years ago

Many records from USGS are resulting in a transformation error due to an error in the fgdc to iso XSL.

This is one of the CSDGM records causing this error: http://data.usgs.gov/metadata/Mineral_Resources_On-Line_Spatial_Data/535e99ace4b08e65d60f8e2b.xml

This is the error message:

Transformation to ISO failed The transformation service returned an error for object {0}: [409] net.sf.saxon.trans.XPathException: A sequence of more than one item is not allowed as the first argument of normalize-space() ("http://pubs.usgs.gov/of/1997/o...", "http://pubs.usgs.gov/of/1997/o...", ...)

This is where the transform is taking all URLs from all elements into ONE CI_OnlineResource/linkage/URL element: https://github.com/GSA/ckanext-geodatagov/blob/master/conversiontool/fgdc2iso/fgdcrse2iso19115-2.xslt#L4308

Thanks in advance for looking into this. I'm available for question or further testing if needed.

kvuppala commented 9 years ago

@amilan17 In FGDC metadata, is it allowed to have multiple URLs under the tag as provided below, looks like the ISO transformation is expecting only one value here, should the transformation convert all these multiple links into a separate resource in CKAN catalog, and apply the same name, desc for all resources as defined under tag?

<digform>
                <digtinfo>
                    <formname>Arc/Info export</formname>
                    <formvern>7.x</formvern>
                    <formcont> Gridded files for the Alaska composite (akc*) and merged (akm*) aeromagnetic data. New versions of the grids were added to the web site in February 1999. These grids are akc_msat* and akm_msat*.  The new grids contain a regional surface correction based on a satellite magnetic model of the long wavelengths of the Earth's magnetic field (see March 1999 issue of GSA Today for more information).  The original grids contained a questionable long-wavelength trend which caused the NW portion of the grids to be tipped downward (there was also a spurious trend with a different slope in SE Alaska). </formcont>
                    <filedec>gzip -d</filedec>
                    <transize>8.7</transize>
                </digtinfo>
                <digtopt>
                    <onlinopt>
                        <computer>
                            <networka>
                                <networkr>http://pubs.usgs.gov/of/1997/ofr-97-0520/data/akc_e00.gz</networkr>
                                <networkr>http://pubs.usgs.gov/of/1997/ofr-97-0520/data/akc_msat_e00.gz</networkr>
                                <networkr>http://pubs.usgs.gov/of/1997/ofr-97-0520/data/akm_e00.gz</networkr>
                                <networkr>http://pubs.usgs.gov/of/1997/ofr-97-0520/data/akm_msat_e00.gz</networkr>
                            </networka>
                        </computer>
                    </onlinopt>
                </digtopt>
            </digform>
amilan17 commented 9 years ago

@kvuppala Yes. this xml structure is completely valid FGDC xml. I don't like identical names and descriptions for different URLs in the resulting ISO and technically and it's not really an accurate mapping, because those names and descriptions are for the format, not the URL. I think it will be more correct to re-use the URL in the name of the CI_OnlineResource and not populate the description field.

amilan17 commented 9 years ago

@kvuppala @FuhuXia I think these errors were introduced during this commit: https://github.com/GSA/ckanext-geodatagov/commit/ef33815d57200191e3fa3ee20654ecd721ab517d

kvuppala commented 9 years ago

@amilan17 Thank you, we are looking at this and see how we can accommodate both the requirements of harvesting all the links provided in the tags along with feature #86 (above commit address this issue #86)

kvuppala commented 9 years ago

issue is similar to #90

kvuppala commented 9 years ago

More documentation and proposed solution (option 2) is available @ https://docs.google.com/document/d/1wOHSA2RNwjsgDuqDzFTKifQceTBxxvMBmDssnRLivLA/edit