BelgianBiodiversityPlatform / python-dwca-reader

🐍 A Python package to read Darwin Core Archive (DwC-A) files.
BSD 3-Clause "New" or "Revised" License
43 stars 21 forks source link

InvalidArchive: The descriptor references a non-existent field (index=17) #88

Closed MattStata closed 2 years ago

MattStata commented 3 years ago

Hello, I have downloaded a number of simple archives from GBIF, mostly per-genus, and I'm trying to parse them with a Python script. Most of them open and parse properly but some throw an InvalidArchive error, referencing index 17. This occurs at the following point in my script:

with DwCAReader(genus+'.zip') as dwca: for core_row in dwca:

I've tried downloading these again, I've tried unzipping them (which works fine) and rezipping as zip or tgz, and the same error still persists. Can you please suggest anything that I might do to fix this?

niconoe commented 3 years ago

Thanks for the report, @MattStata!

Oh, that looks like a bug(either in python-dwca-reader or at GBIF).

Could you send me such a problematic file (for example by posting the GBIF download link here)? I'll investigate ASAP.

MattStata commented 3 years ago

Sure, so, two that fail are based on the "simple" archive download option for these two genera:

https://www.gbif.org/occurrence/search?taxon_key=2708034 https://www.gbif.org/occurrence/search?taxon_key=2704744

I was able to fix this issue myself by having my main script output the row number where it failed, and having another script that simply drops the bad line (just reading the CSV as text). I then just repeat until no rows throwing this exception are present anymore. Doing this I was able to fix the Fimbristylis archive, which only had a small number of bad rows, but the Sporobolus archive had LOTS of bad rows and I eventually ended up just cutting the end of the file off altogether, since this one was less critical for what I was trying to do.

When I bring the bad rows into Excel it looks like something is happening such that tab characters aren't being recognized as delimiters, potentially due to some kind of weird character that occurs before the tab, because in the offending rows there seems to always be a cell where everything stops and all the remaining text that should be spread out between more cells in the row is all in a single cell despite there being tabs in it. Hope this makes sense, it's a bit odd to try to describe.

niconoe commented 3 years ago

Thanks!

I'll investigate deeper later, but I think a more robust way to work would be to download the data as Darwin Core Archives, rather than the "simple" option provided by GBIF. python-dwca-reader is especially build to work with full Darwin Core Archives. The Darwin Core Archive specifications also allow "simple archives" (that should work here too, but a bug. is always possible), but I'm not sure how GBIF's simple download are compatible with those simple archives specifications.

Sorry for the confusion, those standards aren't as clearly defined as I'd like. Don't hesitate to tell me if it works better with the Darwin Core Archive download!

MattStata commented 3 years ago

The only reason I didn't download the larger Darwin Core Archive files was that I had so many to download and those took quite a bit longer, and the simple archives seemed to contain everything I needed. I'm simply trying to parse all of these to identify recent herbarium collections of key species in these groups and sort them by institution so that I can see which institutions have the best collections to request loans/samples from. I think that my approach of dropping the problem rows that I developed after my initial post is probably fine for my purposes, but hopefully the info I've provided here can help you figure out why this error is occurring and improve things! I will make an attempt with one of the problem ones as a Darwin Core Archive when I have some time later and let you know what happens, also.

niconoe commented 2 years ago

I've finally had a proper look, but it appears it's a complex/messy case due to double quotes characters appearing randomly in data fields. Due to the fact that it's a simple CSV file (as opposed to a proper DwC-A) we have to "sniff" the details of the files using Python facilities, and for some reason Python thinks the double quote is used to quote fields, the corresponding line is therefore not properly split.

I'm not sure there's a better solution TBH, those sniffing methods are always imperfect, but it's the best we can do with that kind of input.