Re-Application Initialisation Scripts

richard-jones commented 9 years ago

As a one-off task at the start of the re-application process, we need to create new applications for all relevant journals following the process:

Select all journals which have in_doaj=True and a created_date prior to 19 March 2014
Create a new Application, seeding it with the data from the current journal (see #322 for the process)

For each publisher with 11 or more re-applications:

Generate a CSV file which contains all of the re-applications, with the columns ordered by ISSN (ascending) (see #316 ) and place it into appropriate storage.
Create a record in the spreadsheet type (bulk_reapplication) relating the generated CSV to the publisher who owns it.

This should be implemented as a script that just pulls together the bits which achieve the various parts of the process (#322 #316 etc)

In advance of doing this, check that the date of 19th March 2014 is absolutely the cut off for when journals in the public site started using the new data model.

richard-jones commented 9 years ago

Implementation Notes:

These are the functions to use at the various stages:

Create an application from a journal: portality.models.Journal().make_reapplication() - this will create (and save) a new reapplication, and return the object it creates.
Generate the csv from a list of reapplications: portality.reapplication.make_csv - this takes a path and a list of reapplications (in any order), and creates a csv on disk
Generate a record in the spreadsheet type: portality.models.reapplication.BulkReapplication is the model object that represents the spreadsheets

Probably best to put the script in portality/scripts rather than in portality/migrate, though it will only be run the once.

emanuil-tolev commented 9 years ago

Some pseudo code also in https://github.com/DOAJ/doaj/pull/352#issuecomment-60588101

Nimphal commented 9 years ago

make_reapplication worked fine, however, make_csv did not behave quite so nicely. After having created the reapplications and stored them in a list, I tried making a CSV, but it only created one column with questions, there were no columns with suggestions. I branched from the latest develop.

test_reapps = []
test_reapps.append(models.Suggestion.pull("8e262fdae85e411eaedbbea83a4bfdbd"))
test_reapps.append(models.Suggestion.pull("ba7c9b024ede4528bd65cd9b7a2dba20"))
reapplication.make_csv("tmp2.csv", test_reapps)

These are the suggestions in my local index.

[{"_index":"doaj","_type":"suggestion","_id":"8e262fdae85e411eaedbbea83a4bfdbd","_score":1,"_source":{"index":{"oa_statement_url":"http://oa.statement","publisher":["The Publisher","Society Institution"],"schema_subject":["LCC:Social Sciences","LCC:Economic theory. Demography"],"license":["CC MY"],"classification":["Social Sciences","Economic theory. Demography"],"title":["The Title"],"country":"United States","waiver_policy_url":"http://waiver.policy","issn":["1234-5678","9876-5432"],"language":["FR","EN"],"homepage_url":"http://journal.url","aims_scope_url":"http://aims.scope","schema_code":["LCC:HB1-3840","LCC:H"],"author_instructions_url":"http://author.instructions.com","editorial_board_url":"http://editorial.board","subject":["word","Social Sciences","key","Economic theory. Demography"]},"last_updated":"2014-10-29T12:23:47Z","admin":{"notes":[{"note":"First Note","date":"2014-05-21T14:02:45Z"},{"note":"Second Note","date":"2014-05-22T00:00:00Z"}],"contact":[{"email":"contact@email.com","name":"Contact Name"}],"application_status":"reapplication","owner":"Owner","editor_group":"editorgroup","current_journal":"pinky_princess_journal","editor":"associate"},"ticked":false,"suggestion":{"suggested_on":"2014-10-29T12:23:47Z"},"created_date":"2014-10-29T12:23:47Z","id":"8e262fdae85e411eaedbbea83a4bfdbd","bibjson":{"allows_fulltext_indexing":true,"archiving_policy":{"policy":["LOCKSS","CLOCKSS",["A national library","Trinity"],["Other","A safe place"]],"url":"http://digital.archiving.policy"},"author_publishing_rights":{"url":"http://publishing.rights","publishing_rights":"Occasionally"},"keywords":["word","key"],"apc":{"currency":"GBP","average_price":2},"subject":[{"code":"HB1-3840","term":"Economic theory. Demography","scheme":"LCC"},{"code":"H","term":"Social Sciences","scheme":"LCC"}],"article_statistics":{"url":"http://download.stats","statistics":true},"title":"The Title","publication_time":8,"provider":"Platform Host Aggregator","format":["HTML","XML","Wordperfect"],"submission_charges":{"currency":"USD","average_price":4},"plagiarism_detection":{"detection":true,"url":"http://plagiarism.screening"},"link":[{"url":"http://journal.url","type":"homepage"},{"url":"http://waiver.policy","type":"waiver_policy"},{"url":"http://editorial.board","type":"editorial_board"},{"url":"http://aims.scope","type":"aims_scope"},{"url":"http://author.instructions.com","type":"author_instructions"},{"url":"http://oa.statement","type":"oa_statement"}],"oa_start":{"year":1980},"editorial_review":{"process":"Open peer review","url":"http://review.process"},"author_copyright":{"url":"http://copyright.com","copyright":"Sometimes"},"institution":"Society Institution","deposit_policy":["Sherpa/Romeo","Store it"],"language":["EN","FR"],"license":[{"open_access":true,"embedded":true,"title":"CC MY","url":"http://licence.url","NC":true,"ND":false,"embedded_example_url":"http://licence.embedded","SA":false,"type":"CC MY","BY":true}],"alternative_title":"Alternative Title","country":"US","publisher":"The Publisher","persistent_identifier_scheme":["DOI","ARK","PURL"],"identifier":[{"type":"pissn","id":"1234-5678"},{"type":"eissn","id":"9876-5432"}]}}},{"_index":"doaj","_type":"suggestion","_id":"ba7c9b024ede4528bd65cd9b7a2dba20","_score":1,"_source":{"index":{"oa_statement_url":"http://oa.statement","publisher":["The Publisher","Society Institution"],"schema_subject":["LCC:Social Sciences","LCC:Economic theory. Demography"],"license":["CC MY"],"classification":["Social Sciences","Economic theory. Demography"],"title":["The Title"],"country":"United States","waiver_policy_url":"http://waiver.policy","issn":["1234-5678","9876-5432"],"language":["FR","EN"],"homepage_url":"http://journal.url","aims_scope_url":"http://aims.scope","schema_code":["LCC:HB1-3840","LCC:H"],"author_instructions_url":"http://author.instructions.com","editorial_board_url":"http://editorial.board","subject":["word","Social Sciences","key","Economic theory. Demography"]},"last_updated":"2014-10-29T12:23:47Z","admin":{"notes":[{"note":"First Note","date":"2014-05-21T14:02:45Z"},{"note":"Second Note","date":"2014-05-22T00:00:00Z"}],"contact":[{"email":"contact@email.com","name":"Contact Name"}],"application_status":"reapplication","owner":"Owner","editor_group":"editorgroup","current_journal":"abcdefghijk_journal","editor":"associate"},"ticked":false,"suggestion":{"suggested_on":"2014-10-29T12:23:47Z"},"created_date":"2014-10-29T12:23:47Z","id":"ba7c9b024ede4528bd65cd9b7a2dba20","bibjson":{"allows_fulltext_indexing":true,"archiving_policy":{"policy":["LOCKSS","CLOCKSS",["A national library","Trinity"],["Other","A safe place"]],"url":"http://digital.archiving.policy"},"author_publishing_rights":{"url":"http://publishing.rights","publishing_rights":"Occasionally"},"keywords":["word","key"],"apc":{"currency":"GBP","average_price":2},"subject":[{"code":"HB1-3840","term":"Economic theory. Demography","scheme":"LCC"},{"code":"H","term":"Social Sciences","scheme":"LCC"}],"article_statistics":{"url":"http://download.stats","statistics":true},"title":"The Title","publication_time":8,"provider":"Platform Host Aggregator","format":["HTML","XML","Wordperfect"],"submission_charges":{"currency":"USD","average_price":4},"plagiarism_detection":{"detection":true,"url":"http://plagiarism.screening"},"link":[{"url":"http://journal.url","type":"homepage"},{"url":"http://waiver.policy","type":"waiver_policy"},{"url":"http://editorial.board","type":"editorial_board"},{"url":"http://aims.scope","type":"aims_scope"},{"url":"http://author.instructions.com","type":"author_instructions"},{"url":"http://oa.statement","type":"oa_statement"}],"oa_start":{"year":1980},"editorial_review":{"process":"Open peer review","url":"http://review.process"},"author_copyright":{"url":"http://copyright.com","copyright":"Sometimes"},"institution":"Society Institution","deposit_policy":["Sherpa/Romeo","Store it"],"language":["EN","FR"],"license":[{"open_access":true,"embedded":true,"title":"CC MY","url":"http://licence.url","NC":true,"ND":false,"embedded_example_url":"http://licence.embedded","SA":false,"type":"CC MY","BY":true}],"alternative_title":"Alternative Title","country":"US","publisher":"The Publisher","persistent_identifier_scheme":["DOI","ARK","PURL"],"identifier":[{"type":"pissn","id":"1234-5678"},{"type":"eissn","id":"9876-5432"}]}}}]

richard-jones commented 9 years ago

Could you have a go at creating a test which fails in this way, and then we can debug it?

On 29 October 2014 12:47, Nevelina Aleksandrova notifications@github.com wrote:

make_reapplication worked fine, however, make_csv did not behave quite so nicely. After having created the reapplications and stored them in a list, I tried making a CSV, but it only created one column with questions, there were no columns with suggestions. I branched from the latest develop.

test_reapps = []test_reapps.append(models.Suggestion.pull("8e262fdae85e411eaedbbea83a4bfdbd"))test_reapps.append(models.Suggestion.pull("ba7c9b024ede4528bd65cd9b7a2dba20"))reapplication.make_csv("tmp2.csv", test_reapps)

These are the suggestions in my local index.

[{"_index":"doaj","_type":"suggestion","_id":"8e262fdae85e411eaedbbea83a4bfdbd","_score":1,"_source":{"index":{"oa_statement_url":"http://oa.statement","publisher":["The Publisher","Society Institution"],"schem a_subject":["LCC:Social Sciences","LCC:Economic theory. Demography"],"license":["CC MY"],"classification":["Social Sciences","Economic theory. Demography"],"title":["The Title"],"country":"United States","waiver_policy_url":"http://waiver.policy","issn":["1234-5678","9876-5432"],"language":["FR","EN"],"homepage_url":"http://journal.url","aims_scope_url":"http://aims.scope","schema_code":["LCC:HB1-3840","LCC:H"],"author_instructions_url":"http://author.instructions.com","editorial_board_url":"http://editorial.board","subject " :["word","Social Sciences","key","Economic theory. Demography"]},"last_updated":"2014-10-29T12:23:47Z","admin":{"notes":[{"note":"First Note","date":"2014-05-21T14:02:45Z"},{"note":"Second Note","date":"2014-05-22T00:00:00Z"}],"contact":[{"email":"contact@email.com","name":"Contact Name"}],"application_status":"reapplication","owner":"Owner","editor_group":"editorgroup","current_journal":"pinky_princess_journal","editor":"associate"},"ticked":fal se,"suggestion":{"suggested_on":"2014-10-29T12:23:47Z"},"created_date":"2014-10-29T12:23:47Z","id":"8e262fdae85e411eaedbbea83a4bfdbd","bibjson":{"allows_fulltext_indexing":true,"archiving_policy":{"policy":["LOCKSS","CLOCKSS",["A national library","Trinity"],["Other","A safe place"]],"url":"http://digital.archiving.policy"},"author_publishing_rights":{"url":"http://publishing.rights","publishing_rights":"Occasionally"},"keywords":["word","key"],"apc":{"currency":"GBP","average_ price":2},"subject":[{"code":"HB1-3840","term":"Economic theory. Demography","scheme":"LCC"},{"code":"H","term":"Social Sciences","scheme":"LCC"}],"article_statistics":{"url":"http://download.stats" < span class="p">,"statistics":true},"title":"The Title","publication_time":8,"provider":"Platform Host Aggregator","format":["HTML","XML","Wordperfect"],"submission_charges":{"currency":"USD","average_price":4},"plagiarism_detection":{"detection":true,"url":"http://plagiarism.screening"},"link":[{"url":"http://journal.url","type":"homepage"},{"url":"http://waiver.policy","type":"waiver_policy"},{"url":"http://editorial.board","type":"editorial_board"},{"url":"http://aims.scope","type":"aims_scope"},{"url":"http://author.instructions.com","type":"author_instructions"},{"url":"http://oa.statement","type":"oa_statement"}],"oa_start":{"year":1980 },"editorial_review":{"process":"Open peer review","url":"http://review.process"},"author_copyright":{"url":"http://copyright.com","copyright":"Sometimes"},"institution":"Society Institution","deposit_policy":["Sherpa/Romeo","Store it"],"language":["EN","FR"],"license":[{"open_access":true,"embedded":true,"title":"CC MY","url":"http://licence.url","NC":true,"ND":false,"embedded_example_url":"http://licence.embedded", "SA":false,"type":"CC MY","BY":true}],"alternative_title":"Alternative Title","country":"US","publisher":"The Publisher","persistent_identifier_scheme":["DOI","ARK","PURL"],"identifier":[{" type":"pissn","id":"1234-5678"},{"type":"eissn","id":"9876-5432"}]}}},{"_index":"doaj","_type":"suggestion","_id":"ba7c9b024ede4528bd65cd9b7a2dba20","_score":1,"_source":{"index":{"oa_statement_url":"http://oa.statement","publisher":["The Publisher","Society Institution"],"schema_subject":["LCC:Social Sciences","LCC:Economic theory. Demography"],"license":["CC MY"],"classification":["Social Sciences","Economic theory. Demography"],"title":["The Title"],"country":"United States","waiver_policy_url":"http://waiver.policy","issn":["1234-5678","9876-5432"],"language":["FR","EN"],"homepage_url":"http://journal.url","aims_scope_url":"http://aims.scope","schema_code":["LCC:HB1-3840","LC C:H"],"author_instructions_url":"http://author.instructions.com","editorial_board_url":"http://editorial.board","subject":["word","Social Sciences","key","Economic theory. Demography"]},"last_updated":"2014-10-29T12:23:47Z","admin":{"notes":[{"note":"First Note","date":"2014-05-21T14:02:45Z"},{"note":"Second Note","date":"2014-05-22T00:00:00Z"}],"contact":[{"email":"contact@email.com","name":"Contact Name"}],"application_status":"reapplication","owner":"Owner","editor_group":"e ditorgro up","current_journal":"abcdefghijk_journal","editor":"associate"},"ticked":false,"suggestion":{"suggested_on":"2014-10-29T12:23:47Z"},"created_date":"2014-10-29T12:23:47Z","id":"ba7c9b024ede4528bd65cd9b7a2dba20","bibjson":{"allows_fulltext_indexing":true,"archiving_policy":{"policy":["LOCKSS","CLOCKSS",["A national library","Trinity"],["Other","A safe place"]],"url":"http://digital.archiving.policy"},"author_publishing_rights":{"url":"http://publishing.rights","publishing_rights":"Occasionally"},"keywords":["word","key"],"apc":{"currency":"GBP","average_price":2},"subject":[{"code":"HB1-3840","term":"Economic theory. Demography","scheme":"LCC"},{"code":"H","term":"Social Sciences","scheme":"LCC"}],"article_statistics":{"url":"http://download.stats","statistics":true},"title":"The Title","publication_time":8,"provider":"Platform Host Aggregator","format":["HTML","XML " ,"Wordperfect"],"submission_charges":{"currency":"USD","average_price":4},"plagiarism_detection":{"detection":true,"url":"http://plagiarism.screening"},"link":[{"url":"http://journal.url","type":"homepage"},{"url":"http://waiver.policy","type":"waiver_policy"},{"url":"http://editorial.board","type":"editorial_board"},{"url":"http://aims.scope","type":"aims_scope"},{"url":"http://author.instructions.com","type":"author_instructions"},{"url " :"http://oa.statement","type":"oa_statement"}],"oa_start":{"year":1980},"editorial_review":{"process":"Open peer review","url":"http://review.process"},"author_copyright":{"url":"http://copyright.com","copyright":"Sometimes"},"institution":"Society Institution","deposit_policy":["Sherpa/Romeo","Store it"],"language":["EN","FR"],"license":[{"open_access":true,"embedded":true,"title":"CC MY","url":"http://licence.url" , "NC":true,"ND":false,"embedded_example_url":"http://licence.embedded","SA":false,"type":"CC MY","BY":true}],"alternative_title":"Alternative Title","country":"US","publisher":"The P ublisher","persistent_identifier_scheme":["DOI","ARK","PURL"],"identifier":[{"type":"pissn","id":"1234-5678"},{"type":"eissn","id":"9876-5432"}]}}}]

— Reply to this email directly or view it on GitHub https://github.com/DOAJ/doaj/issues/309#issuecomment-60917682.

Richard Jones,

Founder, Cottage Labs t: @richard_d_jones, @cottagelabs w: http://cottagelabs.com

Nimphal commented 9 years ago

Will do that, but in the mean time, I am a bit confused what I should be counting.

For each publisher with 11 or more re-applications

From what I can see, the publisher is a list, which means there can be more than one. Is that what the script should be counting by? Or did this relate to the value in the owner field?

emanuil-tolev commented 9 years ago

From what I can see, the publisher is a list, which means there can be more than one. Is that what the script should be counting by? Or did this relate to the value in the owner field?

It's the owner field - each publisher (as in, real world, business which publishes things) has an account with DOAJ. That's what that's referring to - accounts which have 11+ reapps attached to them.

Nimphal commented 9 years ago

I created CSV tests that used directly the output I got from make_reapplication(). The problem is the unicode strings. make_reapplication returns an object with u in front of every string and make_csv() does not like that. Here is the drone build with said tests https://drone.io/github.com/DOAJ/doaj/127

richard-jones commented 9 years ago

ok, great, thanks - I was wondering about the unicode handling. I'll try to look into it later today.

On 30 October 2014 12:07, Nevelina Aleksandrova notifications@github.com wrote:

I created CSV tests that used directly the output I got from make_reapplication(). The problem is the unicode strings. make_reapplication returns an object with u in front of every string and make_csv() does not like that. Here is the drone build with said tests https://drone.io/github.com/DOAJ/doaj/127

— Reply to this email directly or view it on GitHub https://github.com/DOAJ/doaj/issues/309#issuecomment-61081574.

Richard Jones,

Founder, Cottage Labs t: @richard_d_jones, @cottagelabs w: http://cottagelabs.com

richard-jones commented 9 years ago

Ok, I've checked out feature/reapp_csv_unicode and I'm looking at it this afternoon

richard-jones commented 9 years ago

@Nimphal - I've now pushed a fix for this to feature/reapp_csv_unicode. There were some explicit checks for the type of an argument to the csv wrapper which caught str but not unicode.

I've also added the use of the codecs module to handle file writing, since sometimes the standard python file writer barfs over utf-8.

Can you keep an eye out while you are developing for any examples of UTF-8 characters, and whether they get written correctly? codecs.open may require additional arguments (e.g. to be explicitly told "utf-8"), but we can put them in when we know if they're necessary.

emanuil-tolev commented 9 years ago

@Nimphal about unicode examples, might be good to just amend the fixtures (all of them actually, not just reapplications) so as to let all the other tests have a go at unicode. If it breaks loads of tests then leave it for now, but if they all still pass it will be nice to have unicode built into the fixtures that you're modifying anyway.

Nimphal commented 9 years ago

Okay, I will keep an eye for unicode and will fiddle with the fixtures. One more question, where on disk should the CSV files be saved?

emanuil-tolev commented 9 years ago

where on disk should the CSV files be saved?

Ah, good question. I think we were talking about another subdirectory of /cache @richard-jones ? We don't have to put them under /static, we can serve them using send_file and I'm quite convinced by your previous arguments in favour of that over e.g. nginx.

richard-jones commented 9 years ago

I think that the files should go in a directory of their own, probably parallel to cache.

There is already a directory called /upload where uploads go, so perhaps a directory called /reapp_csvs. What I would say, though, is that whatever directory we use, it should actually be a configuration option, so that we can change it in one place, and have that take effect anywhere that cares about it.

e.g. app.config.get("BULK_REAPP_PATH") - see https://github.com/DOAJ/doaj/pull/352#issuecomment-60588101

On 30 October 2014 15:26, Emanuil Tolev notifications@github.com wrote:

where on disk should the CSV files be saved?

Ah, good question. I think we were talking about another subdirectory of /cache @richard-jones https://github.com/richard-jones ? We don't have to put them under /static, we can serve them using send_file and I'm quite convinced by your previous arguments in favour of that over e.g. nginx.

— Reply to this email directly or view it on GitHub https://github.com/DOAJ/doaj/issues/309#issuecomment-61111471.

Richard Jones,

Founder, Cottage Labs t: @richard_d_jones, @cottagelabs w: http://cottagelabs.com

Nimphal commented 9 years ago

Unicode seems to work fine now. Some things about the make_csv function. If the pissn is the same for two (or more) documents, it overwrites the column with the last one. It does not care whether the eissn is different and looks only at pissn. Another thing is, it does not overwrite files. So if an update to a publisher's csv is needed, the file needs to be deleted and recreated (I believe this has something to do with the way csv-s are read in). Should I put such an exists check in my script as an easy fix for this?

emanuil-tolev commented 9 years ago

Should I put such an exists check in my script as an easy fix for this?

I'd go with yes, do that. If it succeeds, delete the file that's already there and let the script make a new one. Can't have outdated columns in files, that'd be disastrous (e.g. if we ever had to rerun the script because of X problem).

richard-jones commented 9 years ago

If the ISSN is the same for two documents, then something is amiss. ISSNs ought to be unique to the journal - are you seeing a lot of these?

make_csv's behaviour is to extract one csv from the document - the pissn if it exists, otherwise the eissn, and use that as the column header. Column headers are unique, so you can only ever have one column with the issn.

On 31 October 2014 11:45, Nevelina Aleksandrova notifications@github.com wrote:

Unicode seems to work fine now. Some things about the make_csv function. If the pissn is the same for two (or more) documents, it overwrites the column with the last one. It does not care whether the eissn is different and looks only at pissn. Another thing is, it does not overwrite files. So if an update to a publisher's csv is needed, the file needs to be deleted and recreated (I believe this has something to do with the way csv-s are read in). Should I put such an exists check in my script as an easy fix for this?

— Reply to this email directly or view it on GitHub https://github.com/DOAJ/doaj/issues/309#issuecomment-61249435.

Richard Jones,

Founder, Cottage Labs t: @richard_d_jones, @cottagelabs w: http://cottagelabs.com

emanuil-tolev commented 9 years ago

Alternatively we could make the CSV wrapper overwrite files by default but I know @Steven-Eardley had some trouble with write-only mode. Still we could add the exists check to the wrapper and turn on this behaviour by default. (All other parts of the system and scripts overwrite files with fresh info by default.)

So write the same code for a check + attempt to delete, but just put it straight in the CSV wrapper, not the script you're writing. Does that sound reasonable?

emanuil-tolev commented 9 years ago

are you seeing a lot of these?

I should probably note I asked Nevelina to ascertain the behaviour since her test data had 2 reapps with the same ISSNs. I don't know whether the real data has this or not - it's not necessarily a problem we should tackle, just a behaviour to know about.

richard-jones commented 9 years ago

Yes, that sounds good. It would be good to have a save with force overwrite mode or something equivalent.

On 31 October 2014 11:58, Emanuil Tolev notifications@github.com wrote:

Alternatively we could make the CSV wrapper overwrite files by default but I know @Steven-Eardley https://github.com/Steven-Eardley had some trouble with write-only mode. Still we could add the exists check to the wrapper and turn on this behaviour by default. (All other parts of the system and scripts overwrite files with fresh info by default.)

So write the same code for a check + attempt to delete, but just put it straight in the CSV wrapper, not the script you're writing. Does that sound reasonable?

— Reply to this email directly or view it on GitHub https://github.com/DOAJ/doaj/issues/309#issuecomment-61250461.

Richard Jones,

Founder, Cottage Labs t: @richard_d_jones, @cottagelabs w: http://cottagelabs.com

emanuil-tolev commented 9 years ago

So write the same code for a check + attempt to delete, but just put it straight in the CSV wrapper, not the script you're writing. Does that sound reasonable?

I realised what problem @Steven-Eardley had with this - how is the CSV wrapper to know whether you intend to write or read to the file in question? We can't delete a file we're about to read from :). We have to delete at the level where the knowledge of what we need (write/read) is. So doing it in the script is actually fine.

Steven-Eardley commented 9 years ago

That sounds right!

On 31/10/14 14:43, Emanuil Tolev wrote:

So write the same code for a check + attempt to delete, but just
put it straight in the CSV wrapper, not the script you're writing.
Does that sound reasonable?
I realised what problem @Steven-Eardley https://github.com/Steven-Eardley had with this - how is the CSV wrapper to know whether you intend to write or read to the file in question? We can't delete a file we're about to read from :). We have to delete at the level where the knowledge of what we need (write/read) is. So doing it in the script is actually fine.

— Reply to this email directly or view it on GitHub https://github.com/DOAJ/doaj/issues/309#issuecomment-61270177.

Nimphal commented 9 years ago

The script is now complete, drone is chewing it over. One thing to note - every time it runs it will create new reapplications for the same journals.

richard-jones commented 9 years ago

I mentioned this briefly over on the pull request comment thread, but adding here for completeness - probably it is worth having the re-application script remove all the old re-applications when it is run, just for ease of testing.

Though actually this is more complex than just delete by query on Suggestions with status "reapplication" as reapplications are linked to their Journals, so you'd also need to update each journal record. It might be interesting to have a Journal.remove_reapplication function which deals with this for you.

Will leave it up to you though, as to whether this would make testing and implementing easier.

Nimphal commented 9 years ago

The latest version of the script deals with unicode (which was quite an issue, actually, as the Python CSV library does not deal with unicode by itself, so @emanuil-tolev and I ended up patching it). The script also removes all reapplications before creating new ones. There is no need to do anything to the journal's reference to a reapplication, since make_reapplication already updates the relevant field.

self.set_current_application(reapp.id)

emanuil-tolev commented 9 years ago

Ah excellent - so is this ready to be run on the test server's dataset?

emanuil-tolev commented 9 years ago

Also cc @richard-jones @Steven-Eardley - new classes have been added to the portality.clcsv module. @Nimphal is right, the Python csv module does not deal with unicode strings. At all, I was pretty disappointed. You have to .encode them to a byte stream. So I copied over some UnicodeReader/Writer generators from the official docs and modified them till we had a working Unicode CSV reading and writing, @Nimphal wrote some proper unicode tests for it.

In case anybody is interested, the portality.scripts.journalcsv script which makes the DOAJ csv at doaj.org/csv handles this by going over each cell in each row and .encode-ing it. So this approach with generators that modify data on the fly is much much better - you can just throw unicode objects at them and they will write a correct CSV, and they are (now) unit tested.

Nimphal commented 9 years ago

Yes, it is ready to be run on test.

richard-jones commented 9 years ago

A couple of things came up in testing today for the initialisation script:

I decided that it probably belongs in the migrate directory. I've done this in my own branch for the moment: https://github.com/DOAJ/doaj/tree/feature/richardreview
The code which generates the email list of rejected applications counts all rejected applications, not just those created in #306 . I've started working on a bunch of minor enhancements in my review branch, and have annotated that with where the code is and where it probably should be:

https://github.com/DOAJ/doaj/blob/feature/richardreview/portality/migrate/reapplication/reapp_init.py#L78

https://github.com/DOAJ/doaj/blob/feature/richardreview/portality/migrate/reapplication/reject_old.py#L41

We need to do some data massaging for some of the email fields - could you make it so that the contact email field is inspected for comma separated lists of emails, emails separated with a "/" and emails separated with " or " (I've seen all of them crop up at some time - there might be others too but it's difficult to guess them). When a field like that is detected, write multiple lines to the email output csv, with a duplicated account field and different email on each line.
I've been trying to work out how to test this in a real environment and I've got quite close but not exact. For example, in my test dataset I was expecting reapplications for 6092 accounts, but got 6137. And I was expecting 50 csvs, but I got 56. I suspect these are to do with the ES queries that I used to predict the results - do you have any queries that you used in testing that you can share which will help us verify that everything has worked as expected in the test environment?

emanuil-tolev commented 9 years ago

The code which generates the email list of rejected applications counts all rejected applications, not just those created in #306 .

I thought that all rejected apps are gonna be deleted by the other script #306 . Then it will reject all pending apps. So then #309 (this issue) gets run, and it makes the right email lists.

richard-jones commented 9 years ago

Nope. See #306.

New rejections are created first (script 1).

All rejections are then removed later (script 2).

Nimphal commented 9 years ago

Does just the contact email field have these multiple values in it? So one name, but various emails for it?

2014-11-12 17:28 GMT+00:00 Richard Jones notifications@github.com:

Nope. See #306 https://github.com/DOAJ/doaj/issues/306.

New rejections are created first (script 1).

All rejections are then removed later (script 2).

— Reply to this email directly or view it on GitHub https://github.com/DOAJ/doaj/issues/309#issuecomment-62757117.

Nimphal commented 9 years ago

If I understood correctly, you suggest moving the rejected email csv code to the reject_old script. Is that correct?

2014-11-17 10:19 GMT+00:00 Nevelina Aleksandrova nevelina@cottagelabs.com:

Does just the contact email field have these multiple values in it? So one name, but various emails for it?

2014-11-12 17:28 GMT+00:00 Richard Jones notifications@github.com:

Nope. See #306 https://github.com/DOAJ/doaj/issues/306.

New rejections are created first (script 1).

All rejections are then removed later (script 2).

— Reply to this email directly or view it on GitHub https://github.com/DOAJ/doaj/issues/309#issuecomment-62757117.

richard-jones commented 9 years ago

Yes, I think that's right. The reject_old script will pull give you a hook into the applications which are being rejected /right now/, so you can just skim the contact details off them as they go past.

The contact email field is the one which sometimes has multiple email addresses in it - people have abused it over the years to cram additional email addresses in, so you see stuff like "person@email.com, otherperson@email.com", and "person@email.com/otherperson@email.com" and even "person@email.com or otherperson@email.com"!

On 17 November 2014 11:03, Nevelina Aleksandrova notifications@github.com wrote:

If I understood correctly, you suggest moving the rejected email csv code to the reject_old script. Is that correct?

2014-11-17 10:19 GMT+00:00 Nevelina Aleksandrova nevelina@cottagelabs.com:

Does just the contact email field have these multiple values in it? So one name, but various emails for it?

2014-11-12 17:28 GMT+00:00 Richard Jones notifications@github.com:

Nope. See #306 https://github.com/DOAJ/doaj/issues/306.

New rejections are created first (script 1).

All rejections are then removed later (script 2).

— Reply to this email directly or view it on GitHub https://github.com/DOAJ/doaj/issues/309#issuecomment-62757117.

— Reply to this email directly or view it on GitHub https://github.com/DOAJ/doaj/issues/309#issuecomment-63289123.

Richard Jones,

Founder, Cottage Labs t: @richard_d_jones, @cottagelabs w: http://cottagelabs.com

Nimphal commented 9 years ago

Gotcha! I just saw you asked about my queries. My local database is fairly barren, I created a few documents with varied properties and then various .all methods gave me the numbers I needed. I'm afraid I don't really have queries to share.

2014-11-17 11:07 GMT+00:00 Richard Jones notifications@github.com:

Yes, I think that's right. The reject_old script will pull give you a hook into the applications which are being rejected /right now/, so you can just skim the contact details off them as they go past.

The contact email field is the one which sometimes has multiple email addresses in it - people have abused it over the years to cram additional email addresses in, so you see stuff like "person@email.com, otherperson@email.com", and "person@email.com/otherperson@email.com" and even "person@email.com or otherperson@email.com"!

On 17 November 2014 11:03, Nevelina Aleksandrova notifications@github.com

wrote:

If I understood correctly, you suggest moving the rejected email csv code to the reject_old script. Is that correct?

2014-11-17 10:19 GMT+00:00 Nevelina Aleksandrova < nevelina@cottagelabs.com>:

Does just the contact email field have these multiple values in it? So one name, but various emails for it?

2014-11-12 17:28 GMT+00:00 Richard Jones notifications@github.com:

Nope. See #306 https://github.com/DOAJ/doaj/issues/306.

New rejections are created first (script 1).

All rejections are then removed later (script 2).

— Reply to this email directly or view it on GitHub https://github.com/DOAJ/doaj/issues/309#issuecomment-62757117.

— Reply to this email directly or view it on GitHub https://github.com/DOAJ/doaj/issues/309#issuecomment-63289123.

Richard Jones,

Founder, Cottage Labs t: @richard_d_jones, @cottagelabs w: http://cottagelabs.com

— Reply to this email directly or view it on GitHub https://github.com/DOAJ/doaj/issues/309#issuecomment-63289624.

Nimphal commented 9 years ago

I have pushed my code to the branch feature/email_scripts_update . The rejected emails code has been moved and it can take multiple emails separated by "," , "or" , "/" and create multiple entries in the spreadsheet.

richard-jones commented 9 years ago

Great, thanks. Looking good. Could you also include the email address separation code into the reapp_init script?

richard-jones commented 9 years ago

Then when you're ready, just make a PR

emanuil-tolev commented 9 years ago

PR #394 has been updated, this is the new change I believe e905ba97ecdb49b5ed3471dc01a6478bcd95f520

richard-jones commented 9 years ago

@dommitchell - in order to check this do the following:

check that the reapplications have started correctly for a variety of publishers
check that the email lists contain the correct information

dommitchell commented 9 years ago

All good!

DOAJ / doaj

Re-Application Initialisation Scripts #309