While using this great python package I ran into a problem when reading some DWC-A with duplicated fields. I know that it is an formatting error of those DWC-A, but I still want to be able to read them.
Just to you know here a example of one these archives with duplicated fields: Allan Herbarium (CHR).
The solution that I found was to adapt the pandas _maybe_dedup_names method from pandas/pandas/io/parsers.py to rename duplicated field's names.
The pandas.read_csv method allows passing namesarguments, and as stated in documentation names parameter must not have duplicated values:
Duplicates in this list are not allowed. Documentation
So, the adapted method _maybe_dedup_namesjust check for duplicates and rename them adding a sequential number: "X.1, X.2, ... X.N" as expected when the parameter mangle_dupe_cols is set to Trueand no names argument is provided to pandas read_csv method.
But, as python-dwca-reader ignores kwargs names and use the qq names from DWC-A meta file, it will throw an exception when names (a.k.a DWC-A fields) are not unique:
Traceback (most recent call last): File "dwca-reader.py", line 39, in <module> ext_df = dwca.pd_read(e.file_descriptor.file_location, parse_dates = False, mangle_dupe_cols = True) File "/home/jose/.local/lib/python3.6/site-packages/dwca/read.py", line 198, in pd_read df = read_csv(self.absolute_temporary_path(relative_path), **kwargs) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f return _read(filepath_or_buffer, kwds) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read _validate_names(kwds.get("names", None)) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 421, in _validate_names raise ValueError("Duplicate names are not allowed.") ValueError: Duplicate names are not allowed.
Maybe a better solution should merge duplicated fields, but for now using mangle_dupe_cols = False is not supported by pandas and the solution is being considering complex to implement into pandas (please, see Pandas Issue 13262). But, maybe for the scope of this package the merge solution should be easier to implement.
Please, let me know if I could help developing a merging solution if you agree that it will be better than just rename the duplicated fields.
Hi @niconoe,
While using this great python package I ran into a problem when reading some DWC-A with duplicated fields. I know that it is an formatting error of those DWC-A, but I still want to be able to read them. Just to you know here a example of one these archives with duplicated fields: Allan Herbarium (CHR).
The solution that I found was to adapt the pandas
_maybe_dedup_names
method from pandas/pandas/io/parsers.py to rename duplicated field's names.The
pandas.read_csv
method allows passingnames
arguments, and as stated in documentationnames
parameter must not have duplicated values:So, the adapted method
_maybe_dedup_names
just check for duplicates and rename them adding a sequential number: "X.1, X.2, ... X.N" as expected when the parametermangle_dupe_cols
is set toTrue
and nonames
argument is provided to pandasread_csv
method.But, as python-dwca-reader ignores kwargs
names
and use the qq names from DWC-A meta file, it will throw an exception whennames
(a.k.a DWC-A fields) are not unique:Traceback (most recent call last): File "dwca-reader.py", line 39, in <module> ext_df = dwca.pd_read(e.file_descriptor.file_location, parse_dates = False, mangle_dupe_cols = True) File "/home/jose/.local/lib/python3.6/site-packages/dwca/read.py", line 198, in pd_read df = read_csv(self.absolute_temporary_path(relative_path), **kwargs) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 685, in parser_f return _read(filepath_or_buffer, kwds) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 454, in _read _validate_names(kwds.get("names", None)) File "/home/jose/.local/lib/python3.6/site-packages/pandas/io/parsers.py", line 421, in _validate_names raise ValueError("Duplicate names are not allowed.") ValueError: Duplicate names are not allowed.
Maybe a better solution should merge duplicated fields, but for now using
mangle_dupe_cols = False
is not supported by pandas and the solution is being considering complex to implement into pandas (please, see Pandas Issue 13262). But, maybe for the scope of this package the merge solution should be easier to implement.Please, let me know if I could help developing a merging solution if you agree that it will be better than just rename the duplicated fields.
Thanks.
best regards.