GSA / data.gov

Main repository for the data.gov service
https://data.gov
Other
591 stars 91 forks source link

Fail to harvest IIS WAF #4827

Closed FuhuXia closed 3 weeks ago

FuhuXia commented 1 month ago

Fail to harvest source https://hazards.fema.gov/filedownload/metadata/. Something to do with https://github.com/ckan/ckanext-spatial/issues/319. Need to exam the iis parser and make it more inclusive.

How to reproduce

harvest IIS WAF with subfolders like this

Tuesday, June 30, 2015  3:41 PM        <dir> mydir
Tuesday, June 30, 2015  3:30 PM        13867 one.xml

Expected behavior

traverse into mydir and harvest files under the sub folder

Actual behavior

ignore files under mydir

Sketch

[Notes or a checklist reflecting our understanding of the selected approach]

FuhuXia commented 1 month ago

A PR is submitted to upstream https://github.com/ckan/ckanext-spatial/pull/337

But we want to stay at current release version and cherry-pick this fix. So we should use our fork in catalog repo's requirement.in.

FuhuXia commented 4 weeks ago

Still have issue with harvest source https://hazards.fema.gov/filedownload/metadata/, the harvester does not traverse into folders. It turns out harvester is expecting relative path for each folder, for example, for a WAF like this

Sunday, April 14, 2024  7:55 PM        <dir> R01
Sunday, April 14, 2024  7:55 PM        <dir> R02

Harvester is expecting R01/, R02/, but this IIS WAF is using full path /filedownload/metadata/R01/, /filedownload/metadata/R02/. Harvester is designed to ignore any path starting with /, as in this code.

Nginx and Apache servers are fine. We need to research if IIS is using full path by default, or it is a custom setting by this particular IIS server, and then come up with a fix accordingly.

FuhuXia commented 3 weeks ago

Further fix is done to address IIS folder url.