biocaddie / prototype_issues

Used to report and track bioCADDIE prototype issues
3 stars 5 forks source link

GigaDB authors appearing oddly in DataMed #261

Open only1chunts opened 7 years ago

only1chunts commented 7 years ago

While checking fixes of other ticket ( #260 ) I noticed that somehow the Creators (what we call authors) are abit messed up in some of our datasets? for example: https://datamed.org/display-item.php?repository=0038&id=5823aa4f5152c679aa0a8a26

image

In datacite (https://data.datacite.org/10.5524/100223 ) these are correct, so where are you pulling the creator information from? The XML for creators in datacite looks like this:

<creators>
-<creator>
<creatorName>Eric Earl</creatorName>
</creator>
-<creator>
<creatorName>Damion V Demeter</creatorName>
<nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0003-1931-3587</nameIdentifier>
</creator>
-<creator>
<creatorName>Kate Mills</creatorName>
<nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-6463-186X</nameIdentifier>
</creator>
-<creator>
<creatorName>Glad Mihai</creatorName>
</creator>
-<creator>
<creatorName>Luka Ruzic</creatorName>
</creator>
-<creator>
<creatorName>Nick Ketz</creatorName>
</creator>
-<creator>
<creatorName>Andrew Reineberg</creatorName>
</creator>
-<creator>
<creatorName>Marianne C Reddan</creatorName>
</creator>
-<creator>
<creatorName>Anne-Lise Goddings</creatorName>
<nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0003-4779-8956</nameIdentifier>
</creator>
-<creator>
<creatorName>Javier Gonzalez-Castillo</creatorName>
<nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-6520-5125</nameIdentifier>
</creator>
-<creator>
<creatorName>Krzysztof J Gorgolewski</creatorName>
<nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0003-3321-7583</nameIdentifier>
</creator>
</creators>
bozyurt commented 7 years ago

Using the following datacite API URL

http://api.datacite.org/dats?publisher-id=bl.bgi

The data returned (converted to JSON without any transformation from XML returned) has different structure for creators tag the structure you have on the web site. I have sampled this example from the datacite API around 10:30 am PST today.

"attributes": { "identifiers": [{ "identifier": "doi:10.6084/M9.FIGSHARE.C.3750728_D19", "identifier-source": "DataCite" }], "title": "Additional file 8: Figure S2. of Small RNA sequencing reveals a role for sugarcane miRNAs and their targets in response to Sporisorium scitamineum infection", "types": [{"information": {"value": { "id": "dataset", "title": "Dataset", "updated_at": "2016-09-21T00:00:00Z" }}}], "creators": [ { "first-name": null, "last-name": null }, { "first-name": null, "last-name": null }, { "first-name": "Ning", "last-name": "Huang" }, { "first-name": "Feng", "last-name": "Liu" }, { "first-name": null, "last-name": null }, { "first-name": null, "last-name": null }, { "first-name": "Waqar", "last-name": "Ahmad" }, { "first-name": null, "last-name": null },

Burak On 04/24/2017 01:49 AM, Chris Hunter wrote:

While checking fixes of other ticket ( #260 https://github.com/biocaddie/prototype_issues/issues/260 ) I noticed that somehow the Creators (what we call authors) are abit messed up in some of our datasets? for example: https://datamed.org/display-item.php?repository=0038&id=5823aa4f5152c679aa0a8a26

image https://cloud.githubusercontent.com/assets/6037145/25329081/3b83580a-28d2-11e7-8838-984fe1f3ee3a.png

In datacite (https://data.datacite.org/10.5524/100223 ) these are correct, so where are you pulling the creator information from? The XML for creators in datacite looks like this:

| - Eric Earl - Damion V Demeter <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0003-1931-3587 - Kate Mills <nameIdentifier nameIdentifierScheme="ORCID" schemeURI="http://orcid.org/">0000-0002-6463-186X - Glad Mihai - Luka Ruzic -

Nick Ketz - Andrew Reineberg - Marianne C Reddan - Anne-Lise Goddings 0000-0003-4779-8956 - Javier Gonzalez-Castillo 0000-0002-6520-5125 - Krzysztof J Gorgolewski 0000-0003-3321-7583 | — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or mute the thread .

-- I. Burak Ozyurt PhD Project Scientist University of California, San Diego 9500 Gilman Drive, M/C 0608 La Jolla, CA 92093-0608

only1chunts commented 7 years ago

Hi @boyzurt , I'm not sure where you got that API URL, but its not valid, you can put anything or even nothing in place of bl.bgi and get exactly the same results returned (which are nothing to do with GigaDB/BGI). Perhaps DataCite have updated their API since you last used it? I dont know what info you need to return, but it looks like this swapping data-center-id for publisher-id might work? e.g. https://api.datacite.org/dats?data-center-id=bl.bgi The returned data does indeed seem to have some null values in some of the author names, it also doesn't return a large proportion of our datasets, so clearly there is a problem with the DataCite API. I will chase them from my side as a member, can you contact them about it from your side as a user too? that way we should get a fairly swift response. Thanks.

Update- It appears the default is to return on 25 rows, using : https://api.datacite.org/dats?data-center-id=bl.bgi&rows=1000 returns all our dataset (I think).

Update2 - Looking more closely, the term "dats" in the API doesn't appear in their documentation, so perhaps that is an outdated term that used to be used? Anyway if you replace that with "works" it returns everything correctly: https://api.datacite.org/works?data-center-id=bl.bgi&rows=1000

So unless you think there is still a problem with datacite (let me know) I'll not contact them.

bozyurt commented 7 years ago

Reindexed using the revised transformation script

http://biocaddie.scicrunch.io/datacite_bgi_20170426/dataset


From: Xiaoling [notifications@github.com] Sent: Monday, April 24, 2017 9:16 AM To: biocaddie/prototype_issues Cc: Ozyurt, Ibrahim; Mention Subject: Re: [biocaddie/prototype_issues] GigaDB authors appearing oddly in DataMed (#261)

Assigned #261https://github.com/biocaddie/prototype_issues/issues/261 to @bozyurthttps://github.com/bozyurt.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/biocaddie/prototype_issues/issues/261#event-1055089163, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHLczA3l60rumsx7t2GfRO6o7N8p0YBmks5rzMrsgaJpZM4NF3MT.

naturalbeau commented 7 years ago

@bozyurt I did not see the new index in the endpoint. Can you double check ?

bozyurt commented 7 years ago

I have deleted it since the change was not reflected in it. I will send an email once I have the correct index.


From: Xiaoling [notifications@github.com] Sent: Wednesday, April 26, 2017 12:29 PM To: biocaddie/prototype_issues Cc: Ozyurt, Ibrahim; Mention Subject: Re: [biocaddie/prototype_issues] GigaDB authors appearing oddly in DataMed (#261)

@bozyurthttps://github.com/bozyurt I did not see the new index in the endpoint. Can you double check ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/biocaddie/prototype_issues/issues/261#issuecomment-297516597, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHLczLb8qRAMFIrAz6r86wzLOxgRJC5nks5rz5sugaJpZM4NF3MT.

bozyurt commented 7 years ago

Reindexed using the revised transformation script. Now the dataRepository subdocument points to GigaDB.

The author appearing oddly is an artifact of the DataCite API and can be only fixed at the DataCite.

http://biocaddie.scicrunch.io/datacite_bgi_20170426/dataset

Burak


From: Xiaoling [notifications@github.com] Sent: Wednesday, April 26, 2017 12:29 PM To: biocaddie/prototype_issues Cc: Ozyurt, Ibrahim; Mention Subject: Re: [biocaddie/prototype_issues] GigaDB authors appearing oddly in DataMed (#261)

@bozyurthttps://github.com/bozyurt I did not see the new index in the endpoint. Can you double check ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/biocaddie/prototype_issues/issues/261#issuecomment-297516597, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHLczLb8qRAMFIrAz6r86wzLOxgRJC5nks5rz5sugaJpZM4NF3MT.

naturalbeau commented 7 years ago

The Datamed has been updated using this new index.

only1chunts commented 7 years ago

Sorry I'm not quite sure I follow the last few posts, to be clear there IS or is NOT a problem with something at DataCite? if IS, please can you explain what exactly that problem is so I can chase them on it, thanks.

jgrethe commented 7 years ago

Hi Chris, We have contacted Martin @DataCite as there are two issues we have seen: 1) 500 errors from their DATS services 2) Some of the authors contain NULL - so we need to see if they are supposed to be NULL or if something is wrong.

Cheers, Jeff

On Apr 26, 2017, at 22:49, Chris Hunter notifications@github.com wrote:

Sorry I'm not quite sure I follow the last few posts, to be clear there IS or is NOT a problem with something at DataCite? if IS, please can you explain what exactly that problem is so I can chase them on it, thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_biocaddie_prototype-5Fissues_issues_261-23issuecomment-2D297618283&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=VoBnXVyFWlbGZNrES22a2h0PqI4m1kGEUyTwLwBBYIQ&m=vkZNI-1gveqvI7Z5YhaNqS4IG1HVodl19Y7AF5t4fWc&s=fPpMCoajfaD5xRRewnVnvV5jnjARmVm_moW2bY8_stA&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABjyZfhqhEqrSqF9MpPiPUHxGxfBCs1dks5r0CxXgaJpZM4NF3MT&d=DwMFaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=VoBnXVyFWlbGZNrES22a2h0PqI4m1kGEUyTwLwBBYIQ&m=vkZNI-1gveqvI7Z5YhaNqS4IG1HVodl19Y7AF5t4fWc&s=bWqrfsV4H95hGtbZWjXGeTkCYP9GC8LK5yRQ27ng2Ls&e=.


Jeffrey S. Grethe, Ph.D. email: jgrethe@ncmir.ucsd.edu mailto:jgrethe@ncmir.ucsd.edu
University of California, San Diego 9500 Gilman Drive, M/C 0608 La Jolla, CA 92093-0608

work: (858) 822-0703 ( fax: (858) 246-0644

UCSD Profile: http://profiles.ucsd.edu/jeffrey.grethe http://profiles.ucsd.edu/jeffrey.grethe ORCID: http://orcid.org/0000-0001-5212-7052 http://crbs.ucsd.edu/ LinkedIn: http://www.linkedin.com/in/jgrethe http://www.linkedin.com/in/jgrethe

This e-mail/fax message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail/fax and destroy all copies of the original message.