VertNet / gulo

Shredding Darwin Core Archives with ferocity, strength, and Cascalog.
7 stars 5 forks source link

line breaks and/or quotes in field causing queries to fail #59

Closed robinkraft closed 11 years ago

robinkraft commented 11 years ago

The fm_birds... data (and possibly others) is causing stats queries to fail. The number of fields is wrong - instead of 196, it'll be 186, 11 or 7, or the like:

https://gist.github.com/robinkraft/5448436

My hunch is that it has to do with the quotes and/or EOL character in the new, possibly duplicated ?rights field (now called ?rights-extra per this comment):

"Copyright © 2012 The Field Museum of Natural History
Full details may be found at http://fieldmuseum.org/about/copyright-information"

Yes, that's a \n line break in the middle of the field. And the quotes appear in the plaintext harvested data.

robinkraft commented 11 years ago

Pull request #61 handles this now by stripping out linebreaks during parsing of eml files.