VertNet / gulo

Shredding Darwin Core Archives with ferocity, strength, and Cascalog.
7 stars 5 forks source link

harvest seems to add leave \n in content of last field in row #113

Closed tucotuco closed 10 years ago

tucotuco commented 10 years ago

The symptoms appeared in VN portal download (as broken line following namepublishedinyear) until pull request https://github.com/VertNet/webapp/pull/410. They still appear in record details, where namepublishedinyear shows on the list of fields in the all terms tab, but without an apparent value. The actual value is '\n'.

Suspect the problem might be harvest-fields processing in https://github.com/VertNet/gulo/blob/develop/src/clj/gulo/fields.clj#L64.

robinkraft commented 10 years ago

Here's output for one small resource:

https://www.dropbox.com/s/tmsna3h50ckbn5a/msbobs_mamm.txt

That looks ok to me. The only issue is that the last field is empty, followed by a line break. That's not strictly an issue, since the dwca-indexer should be splitting on line breaks, and then tabs.

I see this at the end of a record:

\tICZN\t\t\t\t\t\t\t\t\t\t\t\t\t\tsex: female\t\n
robinkraft commented 10 years ago

p.s. that's prior to splitting on linebreaks.

tucotuco commented 10 years ago

Gulo correct. Solution implemented in dwc-indexer be replacing non-printing characters with space, then trimming.