CatalogueOfLife / data

Repository for COL content
7 stars 2 forks source link

Authorship strings truncated to 100 characters #145

Open aoern opened 4 years ago

aoern commented 4 years ago

Hi @yroskov and @gdower ! This is an old issue. It seems to be relevant still in CoL+, because the new Species Fungorum Plus update did not help.

In the Jun 2020 edition there are 321 long authorship strings that are truncated to 100 characters. The top five source datasets are: Species Fungorum Plus 184 Collembola.org 47 Microsporidia 24 World Plants 23 ReptileDB 21 The longest one found by me belongs to Aspergillus hongkongensis (208 characters): C.C. Tsang, T.W.S. Hui, K.C. Lee, J.H.K. Chen, A.H.Y. Ngan, E.W.T. Tam, J.F.W. Chan, A.L. Wu, M. Cheung, B.P.H. Tse, A.K.L. Wu, C.K.C. Lai, D.N.C. Tsang, T.L. Que, C.W. Lam, K.Y. Yuen, S.K.P. Lau & P.C.Y. Woo

Where does the 100 char limit come from? To my opinion, the limit should be get rid of in CoL+.

I have assembled a list of required fixes to repair the strings. If applicable, I can deliver it to you for manual fixing.

mdoering commented 4 years ago

There is no limit in CoL+, it must be present in the sources already. All text in the postgres database can be as long as its likes to, there never is any limit

mdoering commented 4 years ago

And of course for data that has not been updated the truncated strings have been copied from the previous mysql db already into the ACEF files

yroskov commented 4 years ago

Authorstring truncation is a problem of the converter from Assembly_Global to production database.

Example from Species Fungorum Plus looks OK in the clearinghouse:

image

Well, portion "Kew Mycology 2020" adjusted to the authorstring is a known bug in CoL+

aoern commented 4 years ago

Hi @yroskov and @gdower !

I don't quite understand regarding the issues #148 and this one as duplicates. They look totally unrelated to me, except for the same kind of symptoms.

  1. Truncation to 100 characters Statement by Yury above, "Authorstring truncation is a problem of the converter from Assembly_Global to production database", makes sense to me. That would explain the whole thing. Statement by Geoff, "The data truncation issue is really a problem with the old infrastructure which produces the DarwinCore archive that you are using", does not seem correct to me. The truncation is seen in the public UI as well, and I guess the UI does not use DwC files. Furthermore, these truncated authorships were inherited from CoL to CoL+.

  2. Truncation to 50 characters Ok, re-importing will fix this one.

gdower commented 4 years ago

The legacy databases and converter that are causing the truncation issues are providing data to both the legacy web portal and legacy DarwinCore exporter. Once we re-import the data into the new infrastructure, and the new web portal and new DarwinCore exporter are released, the truncation issue should be fixed. I put this on the annual edition project to do list with the aim of updating the affected databases.