clevercanary / hca-atlas-tracker

Apache License 2.0
0 stars 0 forks source link

Determine why certain HCA Data Repository DOI's don't match #302

Open NoopDog opened 4 months ago

NoopDog commented 4 months ago

The HCA Repository wrangler team reviewd their backlog in the tracker and found eight studies that are ingested in the HCA Data Repository but show as ingest TODO in the tracker.

The studies and their HCA IDs are:

DOI HCA Project ID
10.1038/s41591-018-0096-5 453d7ee2-319f-496c-9862-99d397870b63
10.1038/s41591-019-0733-7 9d97f01f-9313-416e-9b07-560f048b2350
10.1038/s41467-021-24607-6 7f351a4c-d24c-4fcd-9040-f79071b097d0
10.1038/s42255-022-00531-x daef3fda-2620-45ae-a3f7-1613814a35bf
10.1038/s43018-020-00121-4 bcdf233f-9246-4c0c-9843-0514120b7e3a
10.1126/science.abg0928 c211fd49-d980-4ba1-8c6a-c24254a3cb52
10.1038/s41586-021-04044-7 031980e6-9f2b-433a-8f6e-081bd9aad0a3
10.1177/00220345221147908 ccc3b786-1da0-427f-a45f-76306d6143b6

The tracker entries for these DOIs can be found here.

NoopDog commented 4 months ago

@hunterckx can you see why these DOI do not seem to be able to match on the HCA Data Repo? Thanks! D

hunterckx commented 4 months ago

Here's what I've found:

NoopDog commented 4 months ago

I wonder if there is whitespace in the database entry causing the mismatch. For different DOIs is one a preprint and the other the referenced journal? Will check on the project ids that don't exist.

NoopDog commented 4 months ago

The ones that dont exist seem to not exist. I will check with the EBI team. @hunterckx

hunterckx commented 4 months ago

I wonder if there is whitespace in the database entry causing the mismatch.

Good call -- I considered that and also noticed that 10.1038/s41591-018-0096-5 had a + on the end in the URL, but didn't make the connection until now. 10.1038/s41591-018-0096-5 seems to be the only one that's coming back from the API with whitespace, though

For different DOIs is one a preprint and the other the referenced journal?

Looks like they probably are for 10.1126/science.abg0928, but for 10.1038/s41467-021-24607-6, both DOIs are for journal articles (one of which has a third work as a preprint) and the two don't really look obviously related other than both being about systematic sclerosis