NIAID-Data-Ecosystem / nde-crawlers

Harvesting infrastructure to collect and standardize dataset and computational tool metadata
Apache License 2.0
0 stars 0 forks source link

Fix parsing and display of author information in BD2K-LINCS #107

Closed gtsueng closed 10 months ago

gtsueng commented 11 months ago

The display of author information in dataset records from BD2K-LINCS looks bad for at least 1 record image

It appears that the record may need to be split on commas during the parsing as multiple names are appearing in a single field: "author": [ { "name": "Caitlin E. Mills" }, { "name": "Marc Hafner, Kartik Subramanian" } ], "date": "2018-10-18",

Please conduct a check to see how frequently "," shows up in the "author.name" field for BD2K-LINCS records to better understand the extent of the problem before addressing. We don't want to inadvertently create a bunch of records with 'author.name = Jr.' if the majority of commas in the name field are not from lists but from those with actual commas in their names.

flaneuse commented 11 months ago

@Dylan I think this maybe emerges because we're concatenating two separate author types together. Can you check if that's true and we need to split the author field?

As @gtsueng notes, though, this could have unintended consequences if the authors are provided as a string.

DylanWelzel commented 11 months ago

Unfortunately with the LINCS api down (see dataset page and api) there's no way to be sure, but looking at the code I believe this issue is with the way we split on the comma if there are multiple authors. The way it's setup now we split on a comma + 2 spaces. The dataset ginger linked we can see it's a comma + 1 space. So the split didn't happen on that string and saved both names to the author.name field. Lines 54/56 for reference. What I'll do is split on just the comma and strip all the whitespace from each name. We won't know until the website is back up if that's the actual solution but I believe it is.

DylanWelzel commented 11 months ago

Here is the the dataset's metadata that Ginger found:

  "author": [
    {
      "name": "Caitlin E. Mills"
    },
    {
      "name": "Marc Hafner, Kartik Subramanian"
    }
  ],
gtsueng commented 11 months ago

@DylanWelzel -- can you see how often a comma appears in the author.name field in our current build/dump of BD2K-LINCS data? This should give us an idea of the extent of the problem (is it a parser issue, or a one-off dataset issue). To be on the safe side, also check for the frequency of the terms "Jr" or "Sr". I didn't categorize it as a bug at this point, because it was unclear to me if we are not properly handling the data with our crawler or if the field has inconsistent values in needed of more extensive clean up.

DylanWelzel commented 11 months ago

Using this query 55 out of 424 datasets are affected. No Jr or Sr terms.

gtsueng commented 11 months ago

Thanks @DylanWelzel! Based on your query, I can see that splitting on the comma may cause issues for three of the records which formatted as "LastName, FirstInitial; LastName, FirstInitial", but should otherwise work. It looks like there are some records with extra quotation marks around the names, so we may want to consider checking them after any change is implemented to ensure that it worked.

Semi-colon delimited records: LINCS_LDS-1508, LINCS_LDS-1523, LINCS_LDS-1512

~Extra quotation mark records: LINCS_LDS-1586, LINCS_EDS-1016, LINCS_LDS-1585, LINCS_EDS-1015~ (not reproducible)

gtsueng commented 11 months ago

An email notifying BD2K-LINCS of the 502 Proxy error on their site (that is hampering our investigations) was sent on 2023.07.26.

Asiyah-NIH commented 11 months ago

LINCS team is fixing this. I think this should be counted as "bug" that Scripps will resolve. (context: per Wilbert email 7/25/2023)

gtsueng commented 11 months ago

On 2023.08.04, the LINCS team sent a notice that the issue has been fixed.

gtsueng commented 10 months ago

As of 2023.08.09 a fix is available but is awaiting deployment and testing in staging

gtsueng commented 10 months ago

As of 2023.08.14, the fix has been implemented