ckan / ckanext-spatial

Geospatial extension for CKAN
http://docs.ckan.org/projects/ckanext-spatial
125 stars 190 forks source link

WAF Harvester parsing issues #309

Open benjwadams opened 1 year ago

benjwadams commented 1 year ago

WAF harvesting can fail to parse on numerous things which are a de facto a WAF, such as this listing: https://gcoos4.tamu.edu/erddap/metadata/iso19115/xml/

Because the harvester is looking explicitly for "a href", anything that doesn't exactly follow that string ordering will fail to harvest? Is there any reason why a proper XML parsing library isn't used when finding links instead of using a parsing library, which has known pitfalls when parsing XML?

Also, on the above link, the "apache" parser is used due to the "Server" header, even though this is clearly not an Apache directory listing, but rather a reverse proxied application. This was difficult to track down when I had to create custom logic for the "other" parser to account for some of the shortcomings of the WAF parser mentioned above.

amercader commented 1 year ago

@benjwadams you are right that the parser used in WAF is very brittle. Any improvements on that front would be a great contribution