cernopendata / opendata.cern.ch

Source code for the CERN Open Data portal
http://opendata.cern.ch/
GNU General Public License v2.0
656 stars 147 forks source link

Author list for the CMS 2011 release #848

Closed katilp closed 8 years ago

katilp commented 8 years ago

The author list for the 2011 release is in preparation and will be approved at the CB on the 11th Dec.

pherterich commented 8 years ago

@jalavik do you think you can do some MARCXML magic on that again mid December?

jalavik commented 8 years ago

@pherterich Sure! Just say the word. I assume it is about converting the authorlist XML to MARCXML, correct?

pherterich commented 8 years ago

@jalavik yes, same business as last year

katilp commented 8 years ago

@pherterich For the author list records in general, would it be possible to have a downloadable pdf or something else in the record. Looking at the record http://opendata.cern.ch/record/450, one can only go page by page of all 50 pages, which is not very practical.

pherterich commented 8 years ago

I'm pretty sure we can attach xml/csv/pdf files there. We can discuss the best solution with @tiborsimko next week.

salmele commented 8 years ago

Would there be a dynamic option to update the record itself (and therefore produce on the fly updated xml/csv) as we learn more ORCID IDs of the authors themselves?

salmele commented 8 years ago

Also, once we are at that, how are we going to display the existence of an ORCID ID close to an author name in the page? We've done something discrete e.g. http://repo.scoap3.org/record/12614 with the clickable logo close to the names where we've a ORCID ID

tiborsimko commented 8 years ago

@salmele Yes, it's possible. Currently the record metadata update process goes via source code repository, since the content is mostly "static". (So it's possible, but somewhat less curator-friendly and more programmer-friendly.) We can also amend formats to display ORCIDs.

katilp commented 8 years ago

The author list is now being prepared. This is not the final list list_authors1113.txt but maybe you can confirm if this kind of format is OK for you? Only the name and the affiliation are relevant.

pherterich commented 8 years ago

@jalavik can you give it a look and a first run through the conversion script and let us know about the outcome? Thanks!

katilp commented 8 years ago

The final list is list_authors1113_complete_with_doubles.txt It has all authors, but also some doubles. The first part of the file has all authors 2011-2013 in order of countries and institutes, but then the special author list from our Higgs publication is added just to the end of file, so that many authors appear double. Is it difficult for you to order it properly and remove the doubles?

jalavik commented 8 years ago

@pherterich @katilp I see that this format is a bit different than the standard collaboration authorlist XML format (http://inspirehep.net/info/HepNames/tools/authors_xml/index). Any chance it is possible to provide the authorlist format, like in #402? If not, we can create a new script to convert the data if this is gonna be the way forward. Thanks!

katilp commented 8 years ago

@jalavik We have some changes in personnel who is taking care of this in CMS and that may have caused some changes, my apologies. In any case now we the list without duplicates and special characters, and this is now hopefully a better format authorslist12011_2013.txt

jalavik commented 8 years ago

Thanks @katilp. Unfortunately, after a second look I do notice some potential issues with the format and data provided:

See example from file:

ASRT-ENHEP
Yasser Assran, Sherif Elgammal, Ali Ellithi Kamel, Alaa Metwaly Kuotb Awad, Mohammed Mahmoud, Amr Radi, Nady Bakhet, Shaaban Khalil, Ayman Aly, Adel Awad, Ayman Mahrous, 

ATHENS
Loukas Gouskos, Apostolos Panagiotou, Niki Saoulidou, Efstathios Stiliaris, Theodoros Mertzimekis, 

ATOMKI
Noemi Beni, Sandor Czellar, János Karancsi, Jozsef Molnar, Jozsef Palinkas, Zoltan Szillasi, Andras Fenyvesi, 

We can convert this to something like:

<datafield tag="700" ind1=" " ind2=" ">
  <subfield code="a">Gouskos, Loukas</subfield>
  <subfield code="u">ATHENS</subfield>
</datafield>
...

To me it feels like a little step back from the last one, but I'll let content curators comment on that:

  <datafield tag="700" ind1=" " ind2=" ">
    <subfield code="a">Khachatryan, Vardan</subfield>
    <subfield code="u">Yerevan Phys. Inst.</subfield>
    <subfield code="i">INSPIRE-00314584</subfield>
    <subfield code="h">CCID-667227</subfield>
  </datafield>

https://github.com/cernopendata/opendata.cern.ch/blob/9c852f51b1335f6dc11b18e404cf1ab0ac25ca85/invenio_opendata/testsuite/data/cms/cms-author-list.xml

Let me know if we should continue along this path.

suenjedt commented 8 years ago

Thanks @katilp. The arxiv submissions are usually handled that way and come with such an author xml file (as Jan show's it). I am not sure with whom you have been in touch for that already, apologies. those persons might be able to pull that info from the CMS databases though. Otherwise, we will try to process this here.

katilp commented 8 years ago

@suenjedt Were the inspire and ccid fields provided by us (CMS) last time or was it something you were able to add?

pherterich commented 8 years ago

@jalavik was that done by the script? I think so, but you might know better...

katilp commented 8 years ago

Here is the new list from Gilles and comments: newauthorslist12011_2013.txt

the special characters are a bit funny (see Karancsi below
    ------------      display in unicode is working fine
the affiliations are a bit short (unless there is a mapping somewhere?)
   ------------       i added ( institute id - spireICN ) but what for italy, spireICNa, b or c ?
the names are not structured so lastname, firstname conversions would be inaccurate.
----------------    done

Let me know if this is OK

Update for the experts: the same with spireICNa if exist ( Italy ) newauthorslist12011_2013(1).txt

suenjedt commented 8 years ago

@katilp it occured to me during the night that you mentioned yesterday that a HEPnote will be submitted to arXiv. I presume that submission will have the same author list as the one discussed here? If that is so, both should be following the authors xml "standard" and it would be good to double check the generation of the author xml with the "submitter" on the CMS side. Or have I misunderstood this?

katilp commented 8 years ago

We need a record for this (similar to http://opendatadev.cern.ch/record/450) Then both author list records should also have the author list as pdf attached (update also to the old list) @AnxhelaDani ?

pherterich commented 8 years ago

@jalavik can you give the latest file a look and just let @AnxhelaDani know what comes out of it so she has something to start working with? Thanks! @tiborsimko is the PDF something you can easily create or is that something we should play around with on our side?

suenjedt commented 8 years ago

Thanks all!

jalavik commented 8 years ago

Here is the authorxml initial conversion from the last text file. I merged the author information from the 2010 list and updated affiliations according to new file. Authors not found in the 2010 list was added at the bottom.

Please verify that the number of authors is correct etc.

authorlist2011_2013_marcxml.xml.zip

AnxhelaDani commented 8 years ago

Thank you @jalavik !

katilp commented 8 years ago

@pamfilos For the pdf, margins to the top and the bottom of the pages would be needed. Thanks!