BelgianBiodiversityPlatform / python-dwca-reader

🐍 A Python package to read Darwin Core Archive (DwC-A) files.
BSD 3-Clause "New" or "Revised" License
43 stars 21 forks source link

Add function to check duplicated names in DWCA fields and rename them #81

Open zedomel opened 4 years ago

zedomel commented 4 years ago

Hi @niconoe,

While using this great python package I ran into a problem when reading some DWC-A with duplicated fields. I know that it is an formatting error of those DWC-A, but I still want to be able to read them. Just to you know here a example of one these archives with duplicated fields: Allan Herbarium (CHR).

The solution that I found was to adapt the pandas _maybe_dedup_names method from pandas/pandas/io/parsers.py to rename duplicated field's names.

The pandas.read_csv method allows passing namesarguments, and as stated in documentation names parameter must not have duplicated values:

Duplicates in this list are not allowed. Documentation

So, the adapted method _maybe_dedup_namesjust check for duplicates and rename them adding a sequential number: "X.1, X.2, ... X.N" as expected when the parameter mangle_dupe_cols is set to Trueand no names argument is provided to pandas read_csv method.

But, as python-dwca-reader ignores kwargs names and use the qq names from DWC-A meta file, it will throw an exception when names (a.k.a DWC-A fields) are not unique:

Traceback (most recent call last): File "dwca-reader.py", line 39, in <module> ext_df = dwca.pd_read(e.file_descriptor.file_location, parse_dates = False, mangle_dupe_cols = True) File "/home/jose/.local/lib/python3.6/site-packages/dwca/read.py", line 198, in pd_read df = read_csv(self.absolute_temporary_path(relative_path), **kwargs) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f return _read(filepath_or_buffer, kwds) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read _validate_names(kwds.get("names", None)) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 421, in _validate_names raise ValueError("Duplicate names are not allowed.") ValueError: Duplicate names are not allowed.

Maybe a better solution should merge duplicated fields, but for now using mangle_dupe_cols = False is not supported by pandas and the solution is being considering complex to implement into pandas (please, see Pandas Issue 13262). But, maybe for the scope of this package the merge solution should be easier to implement.

Please, let me know if I could help developing a merging solution if you agree that it will be better than just rename the duplicated fields.

Thanks.

best regards.

csbrown commented 8 months ago

Probably there should be pytest tests for this?