fiduswriter / fiduswriter-citation-api-import

A Fidus Writer plugin to import of citations from external sources via API
GNU Affero General Public License v3.0
0 stars 2 forks source link

Adding the Zenon-database as source #9

Open nmueller18 opened 3 years ago

nmueller18 commented 3 years ago

I would like to see the Zenon-Database added to the possible sources. Taking especially the pubmed- and CrossRef-importers as template, this should not be too difficult. Each Zenon-entry is identified by a unique identifier, and this entry is accessible via a BibTeX-entry. I have modified the files citation_api_import/index.js and citation_api_import/templates.js accordingly and generated an additional file citation_api_import/zenon.js. However, at the moment I am struggling how to parse the records. An example output could look like that:

<div>
                                            <a href="/Record/001219271" class="title getFull" data-view="full">
                                                                    Die Nutzung baltischen Feuersteins an der Schwelle zur Bronzezeit, Krise oder Konjunktur der Feuersteinverarbeitung?                    </a>
                                        </div>

                                        <div>

                                                                                                                                                                                                                                            von                                                            
                                            <a href="  /Author/Home?author=Rassmann%2C+Knut.">Rassmann, Knut.</a>
                                            <br/>

                                                                                                                                Veröffentlicht in                                                        
                                            <a href="/Record/000644412">
                                                                                    Bericht der Römisch-Germanischen Kommission, 81 (2000)                            </a>
                                            <br/>

                                                                                                2000.                            
                                            <br/>

                                                                                                                                                          Umfang/Format:  5-36 : Abb. Taf.
                                            <br/>
                                        </div>

Because it is possible that more than one Zenon id is included in a reference (if this is part of another referenced item), the querySelector needs to cater for this possibility. Would something like that work: const zenonid = el.querySelector('input[<a href="/Record/" class="]').value? Then the rest of the record needs to be parsed to get the three components Author, Title and Published. This should be possible as there are lots of <br/>s and <a>s. But I do not know how to modify the code snippet const descriptionParts = el.innerHTML.split('<br>\n')[1].split(/ <b>\(|\)<\/b>\. /g). Why, for example, is the string split twice?

johanneswilm commented 3 years ago

I would like to see the Zenon-Database added to the possible sources. Taking especially the pubmed- and CrossRef-importers as template, this should not be too difficult. Each Zenon-entry is identified by a unique identifier, and this entry is accessible via a BibTeX-entry.

I agree, this should not be too difficult to achieve.

[...]

Because it is possible that more than one Zenon id is included in a reference (if this is part of another referenced item), the querySelector needs to cater for this possibility. Would something like that work: const zenonid = el.querySelector('input[<a href="/Record/" class="]').value?

The querySelector needs to receive a valid CSS selector. Briefly looking at the source code here, it looks like there are three links in any record:

<a href="/Record/000644412">...</a> <a href=" /Author/Home?author=Rassmann%2C+Knut.">...</a> <a href="/Record/001219271" class="title getFull" data-view="full">...</a>

It is the last record we want, right? In that case it is simple because it can be distinguished by it's class attribute like this:

const zenonLink = el.querySelector('a.getFull')

or, even better, to get the id from the link:

const zenonid = parseInt(el.querySelector('a.getFull').getAttribute('href').split('/').pop())

Then the rest of the record needs to be parsed to get the three components Author, Title and Published. This should be possible as there are lots of <br/>s and <a>s. But I do not know how to modify the code snippet const descriptionParts = el.innerHTML.split('<br>\n')[1].split(/ <b>\(|\)<\/b>\. /g). Why, for example, is the string split twice?

This has simply to do with the structure of the HTML used by one of the other citation database sites. The text wrangling will be very specific to every site (and will need to be updated once the website changes). In this case, I am guessing we need to fetch the author from the links leading to author pages. Those links have no special class, so instead we just need to filter through all included links in the entry, for example like this:

const authors = Array.from(el.querySelectorAll('a')).filter(a => a.getAttribute('href').includes('?author=')).map(a => a.innerText)

which will return:

["Rassmann, Knut."]

If we also want the period gone at the end, we could modify it like this:

Array.from(el.querySelectorAll('a')).filter(a => a.getAttribute('href').includes('?author=')).map(a => a.innerText.replace(/\.$/g,''))

which returns:

["Rassmann, Knut"]