BelgianBiodiversityPlatform / python-dwca-reader

🐍 A Python package to read Darwin Core Archive (DwC-A) files.
BSD 3-Clause "New" or "Revised" License
45 stars 21 forks source link

Terrible performance with large extensions data files #38

Closed niconoe closed 9 years ago

niconoe commented 9 years ago

Using GBIF Downloads, it has been noticed that looping on the archive was incredibly slow when there's a large verbatim.txt data file in addition to the main file. This continue even if we truncate the main occurrence.txt file to 10 records or so.

Reason is easy to identify: there's a design problem in CoreRow's constructor: an _EmbeddedCSV instance is created for each CoreRow. Creating an _EmbeddedCSV is pretty expensive (_line_offsets attribute, mainly), so it should be only done one per archive.