DOAJ / doaj

The Directory of Open Access Journals - website and directory software
Apache License 2.0
55 stars 16 forks source link

OJS export for files 'journal-1.xml' produces Date added to DOAJ error #160

Closed dommitchell closed 10 years ago

dommitchell commented 10 years ago

I have feedback from the user about this issue of the Date added to DOAJ. I think my next step is to ascertain whether this is an OJS DOAJ plugin bug or a DOAJ XML import bug. I have a call with OJS tomorrow so would appreciate your thoughts before then.


I uploaded the JSI issue on March 21 2014 (Including this 'Is social, cultural and recreational participation a luxury for people living in poverty? An analysis of policy intentions and measures' as mentioned) to add the 10 articles to the latest issue, published March 19 2014. All ten articles in the ToC link you sent me [http://doaj.org/toc/673edd8db38949e38a1a523f981f0861] are from that March 2014 issue, they can never be added to DOAJ in 2010 for they did not exist at the time.

In the DOAJ exportfile these are the details for this particular article:

<record><language>eng</language><publisher>Utrecht University of Applied Sciences</publisher><journalTitle>Journal of Social Intervention: Theory and Practice</journalTitle><issn>1876-8830</issn><eissn>1876-8830</eissn><publicationDate>2014-03-19</publicationDate><volume>23</volume><issue>1</issue><startPage>53</startPage><endPage>71</endPage><publisherRecordId>329</publisherRecordId><title language="eng">Is social, cultural and recreational participation a luxury for people living in poverty? An analysis of policy intentions and measures</title><authors><author><name>Lode Vermeersch</name><email>lode.vermeersch@kuleuven.be</email></author><author><name>Anneloes Vandenbroucke</name><email>lode.vermeersch@kuleuven.be</email></author></authors><fullTextUrl format="html">http://www.journalsi.org/index.php/si/article/view/395</fullTextUrl><keywords><keyword>Social participation</keyword><keyword>cultural participation</keyword><keyword>poverty</keyword><keyword>exclusion</keyword><keyword>social policy</keyword></keywords></record>

Clearly there is something wrong with the 'date added to DOAJ'.

Journals that have an exportfile [from the OJS plugin] including a number like Studium ('journal-18.xml' ) journal do not seem to have this problem, there the date added to DOAJ is correct: http://doaj.org/search?source={"query":%20{"bool":%20{"must":%20[{"term":%20{"_type":%20"article"}},%20{"query_string":%20{"query":%20"Christiaan%20Huygens\u2019%20gedachten%20over%20God%20in%20zijn%20Cosmotheoros%20en%20andere%20geschriften"}}]}}}

Journals with an exportfile named 'journal-1.xml' all have this 'date added' problem. They all have a Date added to DOAJ: 2010-04-28.

And some more feedback:

I uploaded the Commons file again yesterday (journal-1.xml).

https://doaj.org/admin/admin_site_search?source={%22query%22:{%22filtered%22:{%22query%22:{%22query_string%22:{%22query%22:%22%20International%20Journal%20of%20the%20Commons%22,%22default_operator%22:%22AND%22}},%22filter%22:{%22bool%22:{%22must%22:[{%22term%22:{%22_type%22:%22article%22}},{%22term%22:{%22bibjson.year.exact%22:%222014%22}},{%22term%22:{%22index.country.exact%22:%22Netherlands%22}}]}}}}}

It took some time, but this time it worked and the articles are now in the DOAJ. But here also the date added to DOAJ is wrong, therefore the system does not say how many new articles it uploaded. For instance the first article of this issue: The role of agri-environmental contracts in saving biodiversity in the post-socialist Czech Republic It also has a Date added to DOAJ of 2010-4-28 while the article was published in 2014 (March 6):

From the exportfile:

<record><language>eng</language><journalTitle>International Journal of the Commons</journalTitle><eissn>1875-0281</eissn><publicationDate>2014-03-06</publicationDate><volume>8</volume><issue>1</issue><startPage>1</startPage><endPage>25</endPage><publisherRecordId>198</publisherRecordId><title language="eng">The role of agri-environmental contracts in saving biodiversity in the post-socialist Czech Republic</title><authors><author><name>Jaroslav Prazan</name><email>prazan.jaroslav@uzei.cz</email><affiliationId>0</affiliationId></author><author><name>Insa Theesfeld</name><email>theesfeld@iamo.de</email><affiliationId>1</affiliationId></author></authors><affiliationsList><affiliationName affiliationId="0">Department of Rural Development, Institute of Agricultural Economics and Information, the Czech Republic</affiliationName><affiliationName affiliationId="1">Leibniz Institute of Agricultural Development in Central and Eastern Europe (IAMO), Germany</affiliationName></affiliationsList><abstract language="eng">Agri-Environmental Schemes are a voluntary policy measure of the Common Agricultural Policy of the European Union. Since 2004, these have been implemented in the post-socialist new Member States. Agri-Environmental Schemes could help to achieve a higher level of biodiversity in protected landscapes. In particular, we analyse whether such types of contract between farmers and state organisations represent a useful tool in the protection of shared natural resources, such as biodiversity. We analyse the determinants that allow for such a policy to be implemented more successfully. In addition, the administrative structure of such a policy measure is very complex since responsibilities overlap among various administrative units, and transactions between farmers and government need to be regulated. Therefore, institutional cooperation among so many parties is challenging. We analyse why implementation has been easier in some Protected Landscape Areas (PLAs) than in others. The research focuses on selected factors which showed differences in performance. In particular, these factors are trust and reciprocity between farmers and state administrative bodies, information spreading and the availability of advisory services. Despite the demanding process, we find an indication that trust tends to grow following a previous good experience. The case study was carried out in two large and two small PLAs in the Czech Republic.</abstract><fullTextUrl format="html">http://www.thecommonsjournal.org/index.php/ijc/article/view/400</fullTextUrl><keywords><keyword>agri-environmental measures</keyword><keyword>biodiversity</keyword><keyword>policy implementation</keyword><keyword>governance structure</keyword><keyword>trust</keyword><keyword>coordination problem</keyword></keywords></record>
richard-jones commented 10 years ago

I believe I've identified the problem, and it's to do with Publisher Record IDs

Essentially, an article in the DOAJ has the same publisher record id as one that is being uploaded, but they are NOT the same article from the publisher's perspective (despite the publisher having assigned them both the same id).

I think we have to take publisher record id matching as unreliable, and drop support for it. I will do that now ...

richard-jones commented 10 years ago

In the XML they provided us, the killer line is:

<publisherRecordId>329</publisherRecordId>
richard-jones commented 10 years ago

@emanuil-tolev I've pushed a fix for this to the phase2b branch - if you could roll it out when you get a chance that would be great.

richard-jones commented 10 years ago

@dommitchell the problem here is that we assumed that publisher's record ids would be unique record ids maintained by the publisher (not unreasonable!). It turns out that they are not, so the fix I've just pushed removes support for de-duplicating based on that criteria. This leaves us with de-duplication based only on doi and fulltext url.

The consequences of things so far are that any publisher which pushed a record with a publisher id which matched an existing publisher id will have overwritten that old article. If the publisher record id has changed its associated with an article in the mean time, this will mean that old articles will get overwritten with incorrect new ones.

This accounts for the odd created date in this case - the created date was completely correct, but the publisher record id caused a record created in 2010 to be overwritten with a record from 2014, but since created dates are preserved across versions this new article had an "old" created date.

Not many publishers are using publisher record ids, and those that are will only be affected if they have uploaded data where those record ids have changed their article association, which I suspect is a result of an OJS re-install or some other import/export operation which reassigned database row numbers. Hopefully, therefore, the impact is small. We just need to monitor whether anyone else has had this problem, and see if there's any data recovery which needs to be done.

We do not yet have historical metadata implemented, which is a shame, but it underlines the importance of us doing that work in the near future too.

emanuil-tolev commented 10 years ago

@emanuil-tolev I've pushed a fix for this to the phase2b branch - if you could roll it out when you get a chance that would be great.

cherry-picked into master and released

dommitchell commented 10 years ago

Firstly, thank you for the quick response! Some questions:

The consequences of things so far are that any publisher which pushed a record with a publisher id which matched an existing publisher id will have overwritten that old article. If the publisher record id has changed its associated with an article in the mean time, this will mean that old articles will get overwritten with incorrect new ones.

OK, so I need to go back to the publisher and ask her to check articles from 2010 and reupload them? Do they need to reupload the ones from March 2014 issue too?

Not many publishers are using publisher record ids

Do we know that for sure?

will only be affected if they have uploaded data where those record ids have changed their article association, which I suspect is a result of an OJS re-install or some other import/export operation which reassigned database row numbers.

This is something I want to discuss with Alec this evening then

We just need to monitor whether anyone else has had this problem, and see if there's any data recovery which needs to be done.

Do you mean that this is something CL can do or something else?

We do not yet have historical metadata implemented

What is this?

dommitchell commented 10 years ago

Notes from call with Alec: Publisher record ID is a database ID internally. Numbers can only change from one version of OJS to another with an uninstall/reinstall. Otherwise the ID should be unique. Ask the publisher if they have uninstalled and reinstalled a version of OJS.

emanuil-tolev commented 10 years ago

Publisher record ID is a database ID internally.

OK, so that means that this is certainly not a technically reliable thing to use. The publisher will almost certainly not know the plugin is producing ID-s which are not suitable for external systems. And there could well be non-OJS publishers doing this.

A shame actually, it would have given technically adept publishers a lot of easy control over their content.

emanuil-tolev commented 10 years ago

We do not yet have historical metadata implemented

What is this?

Each record keeps a timestamped copy of its old metadata (or the changes which occurred) when a change is made, so you can see the record changing over time. It's not trivial to implement but would make mistakes like the above easier to fix. It can have many other advantages too, depending on what site functions are implemented to make use of it beyond data recovery.

EDIT: This was discussed back in December I believe.

dommitchell commented 10 years ago

I'm going to have query @richard-jones and Alec's theory on what is going on here. If we go by the theory that: somehow the Journal of Social Intervention's db on OJS had been reset so that the publisher record IDs had also been reset; therefore the IDs are no longer unique; 8 articles uploaded by the publisher for the March 2014 issues already had IDs in DOAJ, 4 from April 2010 and 4 from January 2010.

Would we therefore not be seeing articles missing from the 2010 issues on DOAJ? I have crosschecked the articles here: http://doaj.org/toc/673edd8db38949e38a1a523f981f0861/18 and http://doaj.org/toc/673edd8db38949e38a1a523f981f0861/19 with http://www.journalsi.org/index.php/si/issue/view/23 and http://www.journalsi.org/index.php/si/issue/view/24 and all articles are present and correct.

emanuil-tolev commented 10 years ago

The ToC regeneration has been delayed slightly due to the load issues with lots of search queries going to the server (which is another story altogether). They can be reenabled though, they've been ruled out as the cause. Try looking at the ToCs in an hour or two and see if those articles are still there.

dommitchell commented 10 years ago

The March 2014 articles were uploaded in March so surely the ToCs would have been regenerated by now?

emanuil-tolev commented 10 years ago

Yes, but the uploads that overwrote them happened in April, did they not? Sorry for the confusion, could be unrelated in that case.

richard-jones commented 10 years ago

Ok, I am investigating again, to see if there are any other possibilities.

On 10 April 2014 10:01, Emanuil Tolev notifications@github.com wrote:

Yes, but the uploads that overwrote them happened in April, did they not? Sorry for the confusion, could be unrelated in that case.

— Reply to this email directly or view it on GitHubhttps://github.com/DOAJ/doaj/issues/160#issuecomment-40056710 .

Richard Jones,

Founder, Cottage Labs t: @richard_d_jones, @cottagelabs w: http://cottagelabs.com

richard-jones commented 10 years ago

Extensive investigation has taken place on this issue. The key culprit seems to be the non-uniqueness of the publisher record id, and that publishers may re-use them over different publications. This could cause articles with matching ids to erroneously match because they match in any part of the publishers corpus of literature, not just within a single journal.

Matching by publisher record id has been permanently disabled in the software and in the live service, and a recording of the history of articles has been added so that any similar issues in the future can be effectively debugged.

Closing this issue for the moment - I don't think there is any more we can do through investigating the data or the code. The code passes all the relevant checks, and there is too much data and not enough provenance to draw any strong conclusions.

dommitchell commented 10 years ago

Account number: 18750281 Name: Igitur Journals

Ars Disputandi : the Online Journal for Philosophy of Religion ISSN(s): 1566-5399

BMGN : Low Countries Historical Review ISSN(s): 0165-0505, 2211-2898

De Zeventiende Eeuw : Cultuur in de Nederlanden in Interdisciplinair Perspectief ISSN(s): 0921-142X, 2212-7402

European Review of Latin American and Caribbean Studies ISSN(s): 0924-0608, 1879-4750

Studium : Tijdschrift voor Wetenschaps- en Universiteits-Geschiedenis ISSN(s): 1876-9055, 2212-7283

Incontri : Rivista Europea di Studi Italiani ISSN(s): 0169-3379

International Journal for Court Administration ISSN(s): 2156-7964

International Journal of Integrated Care ISSN(s): 1568-4156

International Journal of the Commons ISSN(s): 1875-0281

Journal of Chain-Computerisation ISSN(s): 1879-9523

Journal of Indonesian Social Sciences and Humanities ISSN(s): 1979-8431

Journal of Social Intervention : Theory and Practice ISSN(s): 1876-8830

Liber Quarterly : The Journal of European Research Libraries ISSN(s): 2213-056X, 2213-056X

Neerlandistiek.nl ISSN(s): 1567-6633

New West Indian Guide ISSN(s): 1382-2373

Religion and Gender ISSN(s): 1878-5417

Revue Electronique de Litterature Francaise : RELIEF ISSN(s): 1873-5045

TS·> Tijdschrift voor Tijdschrift­studies ISSN(s): 1386-5870

Utrecht Law Review ISSN(s): 1871-515X

richard-jones commented 10 years ago

Data has been restored from backup