DOAJ / doaj

The Directory of Open Access Journals - website and directory software
Apache License 2.0
55 stars 16 forks source link

Correct the license names in the facet so they match creative commons #242

Closed dommitchell closed 9 years ago

dommitchell commented 10 years ago

http://creativecommons.org/licenses/

There should be no hyphen between CC and what follows. eg CC BY, CC BY-SA etc

Needs to be corrected on the facet, on the Application form and in database(??)

http://doaj.org/application/new#license-container

Steven-Eardley commented 10 years ago

In the database, the entries for license appear as follows:

            "license": [
              {
                "type": "CC by-nc-nd",
                "title": "CC by-nc-nd"
              }

            "license": [
              "CC by"
            ],

so I don't think these need changing. I'll change it on the search and application form, plus the main page footer.

richard-jones commented 10 years ago

Do we need also to run a data migration to fix any values that have already been written?

Steven-Eardley commented 10 years ago

Yes, but to fix issue #241, not this one.

Unless: you don't like the inconsistencies with how we display the licenses - in the forms we use upper case (like creativecommons.org does) but in the index, and therefore the search facets, you see lower case for the hyphenated parts. See above snippet.

richard-jones commented 10 years ago

Hm, good point. I don't have any strong feelings about it - what do you think? If we want to become consistent throughout the whole site, and we have to run a data migration anyway then this might be the time to do it.

Steven-Eardley commented 10 years ago

I think it would be nice to run the migration and make them all as Creative Commons labels them, plus another change to datasets.py. Would certainly help me if you can show me the steps.

Which means I disagree with what I first said; the database does need changing.

richard-jones commented 10 years ago

To do this properly, we need to experiment with elasticsearch's scroll search:

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-scroll.html

This is because what we should do is page through the ES result set, make the changes, and then bulk write the updated articles. Scroll search will help us ensure that changes we make to the data doesn't affect the search results we use to iterate through the data we want to migrate.

There is an .iterate() function on the underlying DAO but it doesn't use scroll search, so is only suitable for read-only iteration. We should consider just replacing that iterate function with scroll search support for this.

Steven-Eardley commented 10 years ago

The form text changes are taken care of; the data migration has yet to be done. More relevant to issue #241 anyway.

Steven-Eardley commented 10 years ago

The script still misses a few journals when I'm testing it, leaving some with the wrong label. It does fix issue #241 though.

Steven-Eardley commented 10 years ago

@dommitchell we've come across a snag with changing the license names in the index (and therefore the facets) - There are some articles orphaned from journals, which means they can't pick up the new license string once it's changed by the migration over journals. I.e. these articles contain ISSNs which don't correspond to any journal in the index.

The end result is that even more duplicate license labels are created! Does this need to be solved separately before proceeding with the license migration?

emanuil-tolev commented 10 years ago

I guess going over all the articles as well will be even harder than going just over the journals, which we're already doing?

Steven-Eardley commented 10 years ago

Yes, particularly as the Article model isn't designed to get/set the licenses, rather take them from the parent journal. There shouldn't be any articles without parent journals, should there?

emanuil-tolev commented 10 years ago

No, no, that's certainly a problem. I was just wondering if there was a way to bruteforce our way through the objects regardless. But articles should take this info from their parent journals, no doubt.

One solution might be to: 1/ run a script (regularly) which deactivates all articles without a parent journal, i.e. sets active to False 2/ the license replacement script runs 3/ the journal-info-in-articles script runs

That way you end up with correct info in all articles who have a parent, and the other ones won't show up in the .. well, only in the public facets :). Admin area would still show the old license values.

Steven-Eardley commented 10 years ago

There is a way we can do so, my original (until I knew better) approach to the migration was to perform a regex substitution for the license string in the serialised bibjson. That would force the change!

On 1 Sep 2014, at 21:03, Emanuil Tolev notifications@github.com wrote:

No, no, that's certainly a problem. I was just wondering if there was a way to bruteforce our way through the objects regardless. But articles should take this info from their parent journals, no doubt.

One solution might be to: 1/ run a script (regularly) which deactivates all articles without a parent journal, i.e. sets active to False 2/ the license replacement script runs 3/ the journal-info-in-articles script runs

That way you end up with correct info in all articles who have a parent, and the other ones won't show up in the .. well, only in the public facets :). Admin area would still show the old license values.

— Reply to this email directly or view it on GitHub.

dommitchell commented 10 years ago

I'd like @richard-jones to comment here. It was my understanding that all orphaned articles had either been removed during the original migration or matched to their parent publisher. There should be NO orphaned articles left in DOAJ. They were a remnant of the previous platform where if a journal was removed, the articles didn't automatically follow. (See email I sent)

This platform has that functionality built into it.

emanuil-tolev commented 10 years ago

Tech note: I need to make esprit a dependency of the DOAJ software. The new code will then work and can be automatically deployed to staging and evtl. live. That's delaying the running of the script a bit, I will do it soon as I can though - not a complex thing.

richard-jones commented 10 years ago

I've transferred esprit over to CottageLabs

richard-jones commented 10 years ago

I commented by email, but not here it seems: I don't think that the orphaned articles is a clear cut problem - it seems likely to be an artefact of the test environment not having a complete data set. I'd like to run the script in the staging environment, and then see if there were any issues there.

emanuil-tolev commented 10 years ago

@richard-jones I should have maybe made clear in the esprit issue that it wasn't urgent since I had a way of doing it anyway :).

So, I've merged all the #242 code and ran it on http://staging.doaj.cottagelabs.com . It's all finished. Currently a job is running to update all the articles with the correct license from their parent journal. I imagine when that's finished (maybe 2-3 more hours) this issue and #241 should be ready for review!

richard-jones commented 10 years ago

I have just been checking into the results on the staging server, and the results are interesting. It has worked with the exception of 2366 articles with CC by-nc licences and 15 articles with CC by-nc-nd licences.

A non-comprehensive spot check of the items suggest that the ISSNs of the journals they claim to be part of do not exist (either in staging, or in live), which would explain the results.

But, I checked for the journals that these articles claim to be from, and they are in the DOAJ, just without the ISSN that the article claims for that journal.

For example, consider the CC by-nc-nd articles: http://staging.doaj.cottagelabs.com/search?source={%22query%22:{%22filtered%22:{%22query%22:{%22match_all%22:{}},%22filter%22:{%22bool%22:{%22must%22:[{%22term%22:{%22index.license.exact%22:%22CC%20by-nc-nd%22}}]}}}}}

One of these is called "Limitations of Knowledge Management and Competitive Advantage", it says the ISSN of the journal is 1176-4120 "Journal of Applied Computing and Information Technology".

A search for 1176-4120 produces no results, but a search for "Journal of Applied Computing and Information Technology" yields a result with ISSN 2230-4398.

So, these articles have become separated from their parent because their parent does not have the ISSN in its data. Since these articles have some journal metadata associated with them, those journals must have had that ISSN at some point in the past.

@dommitchell - do you recognise that journal as being one which may have been updated to remove the ISSN?

So, the articles are not orphaned, but their parents have disowned them! This can only happen if the ISSN of a journal is physically removed from its record, which should never happen, so we need to investigate this (propose discussion at next call).

In the mean time, for this issue, I recommend that we add a script which picks up these articles which have not been updated, and updates their licence metadata directly, so we can close this off and move on.

@Steven-Eardley - could you take a look at adding an additional script which iterates over articles with the old licence names and updates them to the new ones. Sort of like the journal update script, but seed it with a query for CC by-nc and CC by-nc-nd licences articles, then just do the licence name substitution as before.

emanuil-tolev commented 10 years ago

when rolling out, ping @dommitchell a few hours before the roll-out so he can editors to stop editing for 15 min or so

dommitchell commented 10 years ago

I think that this work has caused the CC license images to drop out of search results?

http://staging.doaj.cottagelabs.com/search?source={%22query%22:{%22filtered%22:{%22query%22:{%22match_all%22:{}},%22filter%22:{%22bool%22:{%22must%22:[{%22term%22:{%22_type%22:%22journal%22}}]}}}},%22size%22:100,%22sort%22:[{%22created_date%22:{%22order%22:%22asc%22}}]}

emanuil-tolev commented 10 years ago

Oh, got it now. Ok, will check this shortly.

Sent from my HTC

----- Reply message ----- From: "dommitchell" notifications@github.com To: "DOAJ/doaj" doaj@noreply.github.com Cc: "Emanuil Tolev" emanuil@cottagelabs.com Subject: [doaj] Correct the license names in the facet so they match creative commons (#242) Date: Wed, Sep 10, 2014 14:09 I think that this work has caused the CC license images to drop out of search results?

http://staging.doaj.cottagelabs.com/search?source={%22query%22:{%22filtered%22:{%22query%22:{%22match_all%22:{}},%22filter%22:{%22bool%22:{%22must%22:[{%22term%22:{%22_type%22:%22journal%22}}]}}}},%22size%22:100,%22sort%22:[{%22created_date%22:{%22order%22:%22asc%22}}]}

— Reply to this email directly or view it on GitHub.

richard-jones commented 10 years ago

There's either a problem with the deployment or some very aggressive caching going on.

The problem arises from the javascript which renders the records. In the repository the javascript is totally correct:

https://github.com/DOAJ/doaj/blob/master/portality/static/doaj/js/facetview_results_render_callbacks.js#L138

But when I pull this file off the staging server with the web interface, I still see the old version, where "CC BY" is still "CC by", so the mapping from licence name to image is failing.

richard-jones commented 10 years ago

@emanuil-tolev - assigning this to you just now, to check the deployment. Could be to do with updating the static files? Not sure how, exactly, of course...

emanuil-tolev commented 10 years ago

A yes, of course - remember how we took out Steve's latest work on licenses out of master because the license script wasn't quite ready to run and thus the appl. form broke? That's why this file has "CC by".

I'd probably just revisit this once the license script is ready and has been run, and we switch back to using master as our release branch.

emanuil-tolev commented 10 years ago

@richard-jones not sure who should be assigned to this, wrt script.

emanuil-tolev commented 10 years ago
emanuil-tolev commented 10 years ago

Output from update_licenses on staging:

Completed. Run scripts/journalinfo.py to update the articles with the new license labels, and missed_journals.py for the missing journals. 4248 journals were updated, 21 were left unchanged, 8490 had no licence object, and 0 failed. 3243 suggestions were updated, 12105 were left unchanged, 0 had no licence object, and 0 failed.

emanuil-tolev commented 10 years ago

journalinfo took 4 hours to run on staging, hence the delay. Now done.

I've also ran missed_journals on staging. Output was 6982 lines with this as end writing 521.

So staging should be in just the right state now. "master" branch (with all the work merged in) + all the scripts have run.

@dommitchell - does the application form work as expected on http://staging.doaj.cottagelabs.com ? Data on staging was refreshed yesterday so you will have lots of very recent applications and journals to play with.

@Steven-Eardley @richard-jones you can access 95.85.48.213:9200/doaj/ directly to check if the data looks right, and of course use the web UI.

emanuil-tolev commented 10 years ago

@Steven-Eardley @richard-jones https://staging.doaj.cottagelabs.com/admin/journal/0077c5e06f224446aac192d4f0e7defd shows that there is an inconsistency between datastore and form code. I'm not clear on which is the correct form (all uppercase "CC BY-NC-ND" or lowercase "CC by-nc-nd").

In other news the license icons on search results are working correctly.

dommitchell commented 10 years ago

@emanuil-tolev yep, all looking good.

richard-jones commented 10 years ago

Could there still be an issue with the deployment? Looking at the code for the page, the value attribute of the radio button is "CC by", but in my local dev instance it is "CC BY".

emanuil-tolev commented 10 years ago

Hm. You on master? Since staging is (should be, I'll double check) on master too. That was sort of the way in which we would reintroduce all the licensing work @Steven-Eardley did, just switch back to master where it all lives. But thanks anyway, if your local master branch is CC BY I'll investigate asap. All scripts have ran successfully, this is the last thing to get out of the way.

emanuil-tolev commented 10 years ago

Works now, code must have been wrong version or app not reloaded on staging. Will roll out license stuff which will also enable any queued static_pages merges and so forth to be deployed.

emanuil-tolev commented 9 years ago

Rolled out to live. All looks good but feel free to feedback problems. I will close this issue at some point soon if no more comments are made.

dommitchell commented 9 years ago

Looking good.