Closed dommitchell closed 7 years ago
More evidence from Hindawi: "Yes, uploading files is working fine, thank you. But I noticed that some of the re-uploaded files appear duplicated on your website! I think there is a bug occurs when parsing the metadata XML on your systems."
https://doaj.org/toc/1563-5147/2014 101473 3 times 101626 3 times 101808 3 times
https://doaj.org/toc/1687-4129/2014 102621 3 times 103418 3 times
More from Scielo, uploaded on 14th August:
https://doaj.org/toc/1982-4327 - vols 22-25 https://doaj.org/toc/0870-6352 https://doaj.org/toc/1982-4327
I'd like to get this fixed asap, thanks.
Some more articles uploaded yesterday, 17th August: https://doaj.org/toc/1678-9946
I think we have found out why duplication is occurring:
Refreshing the XML upload page causes articles to be duplicated. "after uploading any record to your side it appears as "pending" in our admin page; if we refresh "F5" this page while there are pending records, this will cause the records to be duplicated. The repetition depends on the refresh count! If I refreshed the page 2 times the record will appear 3 times (1 originally submitted + 2 times due to F5 clicks). "
So they're not getting asked "do you want to submit again", they're just refreshing the status page? Obviously if they're ignoring the warning the browser gives upon form resubmission, they're ... going to resubmit the same file. There is code to prevent the processing of duplicated records though, so even if they literally come back and upload the same file 10 min later, both will be processed and only the first should take effect.
We need to do the following things:
1/ add redirect on form submission 2/ Remove "refresh" requests from ingestarticles.py 3/ Introduce esprit's blocking save feature on final article submitted in a file
A spot of trouble with this, I can't get my local one to process things properly for some reason. Not forgotten about it :), still working on it
@emanuil-tolev can I get an update on this? More publishers getting in touch now about duplications.
Some publishers have loaded articles on a large scale. Can CL run something that will delete all the duplicates automatically?
Should be able to put up the hotfix up for tech review today and roll out shortly after (so def. this week).
Can CL run something that will delete all the duplicates automatically?
Yeah, we could - what do you think defines a duplicate article? If 2 records have the exact same title, DOI and ISSNs recorded, is that enough for us? Or shall we go even broader and say exact same title and DOI?
If 2 records have the exact same title, DOI and ISSNs recorded, is that enough for us? Or shall we go even broader and say exact same title and DOI?
Actually, we have to have the URL in there, since matching is done on URL too. I would say:
Same title DOI (if present) Full text URL ISSNs
Date created in the system would be an absolute failsafe ie date added to DOAJ.
Date created in the system would be an absolute failsafe ie date added to DOAJ.
Yeah, but that's down to the second. I'm not sure I'd want to get into not finding duplicates just bc the system created like 3000 articles in the span of 3 seconds - there is the possibility that an article and its duplicate would be created on different seconds. We could measure difference within 5 seconds (any more and the duplicate would have been prevented anyway by the code).
I think if that's ok I'll go for "an article is the same as another article if these fields are exactly the same across both records":
Same title DOI (if present) Full text URL ISSNs
Works for me. Thank you!
Sorry I'd forgotten we already have a match algorithm, we'll use that.
I'll sort out the form and deduplication script to use the existing code, @richard-jones doing the ingestarticles script
There's a PR for this, reviewing/rolling out now
OK, fix has been rolled out and we think this should put a stop to the duplicates appearing. I'll do the dedupe now then.
(do tell if any new ones appear or if people who've had trouble stop having trouble, etc.)
Dedupe not been done yet?
https://doaj.org/toc/1687-4129/2014 https://doaj.org/toc/1687-0425/2004/71
Mudding through it now. Lots of double-checking so I don't delete the entire article index etc. :)
Sorry, still didn't quite manage to run this today, but it is almost there. I had to change our current duplication detection algorithm a bit - not in the sense of how we detect duplicates, but so I get all the possible duplicates. Currently it returns maximum one duplicate, so I'd have to run it an undetermined number of times to get rid of all of them. It was just written with a different use case in mind.
That's not hard, but then there's no way I'm running changed code of that complexity without a test, and that's where the complexity lieth. However, I am almost through writing new tests, so expect to finish this next Tue.
How's this going?
@emanuil-tolev what's the latest here please? I really need to get these duplicates deleted ASAP
Sorry for the lack of update here Dom. This .. is a bit of a mess. You can read details at #915 . I'll quote a comment from the code:
this deletes the original article AND the duplicates when an article with duplicates is found
Obviously we want to keep the original.
This is basically a piece of technical debt - it was written originally for a different purpose and we're trying to make an essentially new feature (delete all duplicates of X) out of it. We have almost succeeded, although the code can definitely use cleaning up and streamlining. For now I am going to try to push for "succeeded and sure this isn't going to delete more than wanted" asap, but it is a difficult thing to achieve.
I'll pull in Steve for a bit of brainstorming tomorrow, hopefully we can pick up my trail and have this offed in a few hours, and I'll provide an update again.
@emanuil-tolev OK, thanks for the update. Please make this your absolute priority. The duplication is now showing up in 3rd party databases that take metadata from us and we are getting complaints. (Not that there is any right to complain since they get our metadata for free but it damages our reputation.) I know you are working your hardest to get this done and it is appreciated!
Yeah the only bright side is at least no matter what duplication we get later, if it ever happens, cleanup will take just rerunning what we're writing now.
FYI after about 12.5% of the index in about 3 hours:
Number of articles deduplicated: 2121 Number of duplicates deleted: 10107
It's relatively slow since we're waiting a second after each delete (1. so the script is nice and quiet and doesn't load the index much since it's already pressing hard looking for each duplicate and 2. we don't accidentally go over the same article again after the current batch of 5000 is exhausted).
So this isn't done yet obviously, but it might be worth checking a few of the examples above @dommitchell . Like https://doaj.org/toc/1982-4327 - vols 22-25 seems to have been processed, so it might be worth knowing if it's looking good. This https://doaj.org/toc/1678-9946 on the other hand has a clear duplicate at the top (at least by title, we'll see if the script catches it as it's stricter).
(Note: article count before dededuping: 2,106,987 Articles)
Yup, https://doaj.org/toc/1982-4327/24 looking good. Will keep an eye on https://doaj.org/toc/1678-9946, https://doaj.org/toc/1687-0425/2004/71 and https://doaj.org/toc/1687-4129/2014
New progress update (note the % and estimates are very rough, the only "real" thing is number deduplicated and number of duplicates deleted):
2.34% done of all articles Number of articles deduplicated: 5645 Number of duplicates deleted: 25161
I'm afraid I was a little optimistic in my last update with the 12.5% - that was 12.5% of the first 1/16th of the index... now we are at 37.5% of the first 1/16th of the index.
So we have processed about 50k articles, and about 5.5k are duplicates - about 11%. This seems like a very high percentage @dommitchell. Having some second thoughts about the accuracy of this, although I went far enough with tests and both Steve and me tested it out manually (on a much much smaller set of records). I didn't run it on the test server as I didn't expect it'd find any duplicates there.
Do you have any rough idea of how many duplicates we're talking about? I can stop the process if the above figures seem too high. It'll be a bit of a pain to restart, but not impossible at all (we have logging on exactly what is being deduplicated and what is being deleted).
I have no idea how many we are talking here and it does seem quite high. However, there may well be duplicates in the db already that were not generated by this 'bug'. I also know that both Hindawi and Scielo have been uploading content to us and both reported duplication. They are two of our biggest publishers so that amount isn't completely unfeasible.
I am inclined to let it run. Any deduping can only be a good thing, especially if it removes dupes that were in already.
To explain the numbers above a bit more:
Number of articles deduplicated: 5645
This is within the 2.34% of all articles. So we've gone over the first 2.34% (about 49k) and out of them we found duplicates for 5645.
Number of duplicates deleted: 25161
This could be from anywhere within the index, not just the first 2.34%. So the articles deduplicated go in a neat sequential fashion, but their duplicates could be from anywhere at all from the remaining amount of articles.
I am inclined to let it run. Any deduping can only be a good thing, especially if it removes dupes that were in already.
OK, I think that's what I feel at the moment as well. Also I suppose the more we go through, the less duplicates will be left. The second half of the index could have a LOT less than the first half.
I suppose I just wasn't prepared for that many deletes, but I think we've got a good plan for the eventualities. The job is running on the server (the same one the actual DOAJ web app is running on), so as long as it doesn't run out of resources it'll keep going. And if it does, no biggie (we'll get an alert) and I'll have to enhance the script a bit to deal with the unexpected amount of data.
@emanuil-tolev I see the article total is now at 2,060,616 Articles (on homepage). I wonder if we need to explain to the community what is going on?
What's the status this morning?
Hey :). Sorry for delay, at an event about the DOAJ datastore.. will prove to be very important in time I think.
Anyway, looking good I think. About 10% done. A lot less deletes!
Number of articles deduplicated: 22597 Number of duplicates deleted: 89018
We've done about 18.75% of the index,
Number of articles deduplicated: 46'001 Number of duplicates deleted: 156'449
Wow! Are there really that many?
What's the difference between article deduped and dupes deleted? I don't understand the nuance.
Does the process run over articles for journals not in DOAJ as well?
This is going to take a week at the current speed.. I'm inclined to propose we stop it and take a good look at the log of results (like take all the IDs, get the articles and get their titles, DOIs and so on). I suppose ideally we'd have run this on a staging server ideally, but I honestly did not expect that many!
Does the process run over articles for journals not in DOAJ as well?
Yes, it does.
What's the difference between article deduped and dupes deleted?
Articles deduplicated is "the articles for which at least one duplicate was found" if you will. Duplicates deleted is the actual number of article records deleted, because some other "original" record was found. The "original" is kept obviously. I say "original" in quotes because since the articles are considered to be the same, there isn't one original, we just consider the first one we chance upon (of a set of duplicates) to be the original.
I'm inclined to propose we stop it and take a good look at the log of results (like take all the IDs, get the articles and get their titles, DOIs and so on).
Alright, I've done this now. If you'd like to put a message out:
DOAJ is working on cleaning up article-level metadata as we remain committed to maintaining a high-quality directory. We have recently developed tools to assist with the automated pruning of spurious records. As a result you may see the number of articles in DOAJ go up (new articles continue to be accepted) and down (as spurious records are deleted). We already have processes and checks in place to ensure that no new spurious records will enter the catalogue.
I'm quite busy today but I'll aim to bring up the daily 6am backup from the day I started the script into the Test server. Then I'll paste the script log and I'll explain what it means. Afterwards any of us can start spot-checking (and I will) the records via the admin search on https://testdoaj.cottagelabs.com to try to understand if we made any unnecessary deletes, or if we really really had that many duplicates, or just what's going on (maybe the metadata for some articles was not otherwise a duplicate, but their DOIs were wrong? we'll see).
@emanuil-tolev did you get the staging server up?
Almost @dommitchell , aiming for this afternoon. A fair bit of configuration that had slipped my mind for a new environment unfortunately.
Finally done with the restore here - I had to restore the dataset to the test server in the end instead of a new one (time constraints with properly configuring a new env) and even that brought the test one down! I finally figured out why - it's not surprising - essentially restoring 3 servers worth of data to 1 isn't perceived well by Elasticsearch. However it seemed to work, no errors in logs, I could access the data manually myself... in the end I had to run DOAJ on the test server "manually" and watch what it did. Then it finally showed me that Elasticsearch was rejecting searches since it was expecting at least 2 servers to contribute to the search results.
I trimmed the data down to 1 copy with one easy command and we're back in business. It'd have been nice if the software logged the actual problem though.
Anyway, let's get on with spot-checking. Use your live account/password to log into https://testdoaj.cottagelabs.com (feel free to change them afterwards). Download https://www.dropbox.com/s/dokkdi866brclwz/dedupe_2015-11-02_081501.log?dl=0 or if that's too big to open in your favourite text editor (watch out, ~7MB text), then open up this instead: https://gist.github.com/emanuil-tolev/f177691f21e490b5b6d1 .
The format is simple:
<article for which deduplicates were found>: <duplicate record 1>,
<duplicate record 2>,
...
<duplicate record N>
<another article for which deduplicates were found>: <duplicate record 1>,
<duplicate record 2>,
...
<duplicate record N>
So groups of duplicate records are separated by newlines.
Concrete example, taking:
0000178c89214dc8b82df1a25c0c478e: 5e45f97e7af54bfbae3586815225643f,
a351f1f4f865469ea5cd787bd96ad83a,
0000178c89214dc8b82df1a25c0c478e is the article which would have been left on production. 5e45f97e7af54bfbae3586815225643f and a351f1f4f865469ea5cd787bd96ad83a would have been deleted in production.
What we need to do now is some spot-checking like this:
Finally, to avoid going over the same articles, I recommend you write in this issue when you're about to start spot-checking with the prefix of the articles you're about to check. The log is ordered by article ID, so if you say "I'm spot-checking all IDs starting with 0" then the next person can try checking "all IDs starting with 1".
(By the way, that article I used as an example, 0000178c89214dc8b82df1a25c0c478e, really does seem to be triplicated. Brace yourself, there are some shocking number of what the script thinks are duplicates, like 7 to 10 duplicate records!)
OK, I am spot-checking all IDS starting with 4. Will report back here.
Finished spot-checking the 4s. My general observation is that we have a bit of a mixed bag of reasons for duplication. There must be a pattern here as to when duplication occurs: many duplicates took place in April and August 2015. Also, there are more article in the database than appear in search. There has to be some reason that the database is making invisible copies of articles. I actually think you will have anough information here to save you from doing your own spotchecks.
This article appears 9 times in your CSV file: 406159dcb45841c69245a9ed00c3f9d3: 9428fe6d14314f9985fb6e8af5682761, a551b5d8542b45f4ba2f6b491a84ac89, ad1d94f679d64825926bff787e9fea4d, fa2ff0962b8c434eb5b82aabda5bbb65, 9428fe6d14314f9985fb6e8af5682761, a551b5d8542b45f4ba2f6b491a84ac89, ad1d94f679d64825926bff787e9fea4d, fa2ff0962b8c434eb5b82aabda5bbb65,
But only twice in the database which would imply that multiple copies were made but only one or two show up in search results. It's the same story for this one: 4060067e554941d9a3687fb021aac65f: 513b4e71c366434683f8cd3736dd155a, 7adbe4279e3e4472a0589e73ad1df1cd, 8da810be94564e96979fe05df282c76c, 513b4e71c366434683f8cd3736dd155a, 7adbe4279e3e4472a0589e73ad1df1cd, 8da810be94564e96979fe05df282c76c,
This article appears twice but only once on the ToC and in search: 4059bf500c364e90af9bf1ca90066ad6: e13ab23e5e7a46a5a3f99e09834b6f49
This, at least, would account for the very high delete rate.
Note also that, in the first example, the two versions appear on the toc: https://doaj.org/toc/2177-3491/60/1 and in article search: 406159dcb45841c69245a9ed00c3f9d3 and 3f6b8c11b90a4c37ae503c03af817292. The latter does NOT appear in the CSV listing so matching hasn't worked 100%. I believe that this is because one has a DOI and one doesn't. I have found several examples of this.
4061421b16324daeb09765f57190e0f5: 59f0359d6fd747fcb37b500467a99a4d - in this example, the titles are different and 59f0359d6fd747fcb37b500467a99a4d has a DOI and a 2nd ISSN. BUT THEY ARE the same article. Again, probably the DOI issue.
406123fce40149a6a09d5d2b832a24ac: 6db67df044a84a26ac94ebc599583d74 - in this example, the articles have different URLs so this is a known reason for duplication and NOT the bug.
Most of the examples I looked at occurred at some point toward the end of August which is before the bug was fixed. However, some examples occur with articles dated 5th Jan 2012 which would imply they were already in the data when it was migrated over: 406088ae781e409394c57c9cd522b4ed: 5c7a962812984ec08a1e4bce934c512e, 5c7a962812984ec08a1e4bce934c512e
The only difference between the 2 articles in this example is that one has a second ISSN: 40607d9df319441a9037e495e8af9df1: c7c8b2ec0b9c4ce8afd273b34b844f73. This would imply that the matching on the full text URL doesn't always work and it has been that way since launch.
In this example, loading took place between March-April 2015 and all 3 are COMPLETELY identical: 405cc33a64f44765a58407c902805de4: ef1e597e08814fa3a4c04fe88a82f7b4, ef1e597e08814fa3a4c04fe88a82f7b4
These articles are NOT the same at all: 4078d3a6c3c54242a87757dfab1c6f59: 42267ab6df5b4a648d17e4c3b62b70b1, 44092ef2db3e4b9aac00de5c0c84fc49, 4942a8564c25447aa1e1e07422aa6a36, 49c6f18af76448eebf8f31ef9d05d173, 49e8eb1c6f144a38ad7da1870ecc63d0, 4a7c6aac67634c9f861c2a789bdc8775, This is the only instance like this that I found over 266 lines. (There are 835 lines of 4s in total.)
Also, there are more article in the database than appear in search. There has to be some reason that the database is making invisible copies of articles.
We do go over all records, incl. articles not in DOAJ, so for those a TOC won't be shown. They still exist in the index and they should be findable via admin search.
We do go over all records, incl. articles not in DOAJ, so for those a TOC won't be shown. They still exist in the index and they should be findable via admin search.
Err, but these are instances of the same article that are in DOAJ. 9 appear in your csv, only 2 appear on the toc. Same article, same journal.
A database has got in touch and said that some articles are missing. Can you check if this one was one that was removed?
"Intelligence system of supply chain management of logistic company based on the discrete event, agent and system dynamic simulation models" Vestnik Astrahanskogo Gosudarstvennogo Tehničeskogo Universiteta. Seriâ: Upravlenie, Vyčislitelʹnaâ Tehnika i Informatika, Iss 2, Pp 143-149 (2012) https://doaj.org/toc/2224-9761
Hmmmm. This isn't appearing on the Test DOAJ (I searched by the title, "Intelligence system of supply chain management of logistic company based on the discrete event, agent and system dynamic simulation models" in admin search), which has a dataset before we ran the delete / deduplicate script. In general you can check on Test.
@richard-jones @emanuil-tolev #1033 made some related changes to article overwriting and it's possible that until it was fixed, new duplicates were introduced into the system. I'd like a new audit done a la #915
I've discovered that the original script used in #915 was a bit too eager to detect duplicates, e.g. from a snapshot of results:
Starting 2017-05-15T15:12:38.908718 with snapshotting in 5 seconds, Ctrl+C to exit.
0066dd4375ef418fb5ac26132f95661a: 06aa386fe7724feea0463772d2103d2b,
000bebd20d2e4cf7abe40b36763c3bcc: 04b04c1930bd46fc84b88228a4cf5e12,
00130649bbc34f569afd494932eed67a: 0575cc421ee94622b0966b2289cf3498,
07995732053d44328a910c2211e46cf0,
0386771cc9fa4f779b33df641c8fcb11: 0032290ebcd549aaad60c4fd4fdfd738,
0041d73baafe411199eaf3bd19c7b0a7,
006592bf7ce54bcc97beaff366b4637b,
006ecbff77264b3eac8375ba2b8c8b36,
0105a021f7da42009a074c5c0f51bd51,
017beaa5f2e842928e62a81bdd5fa0b7,
01b0b1c9649747d1a5667f3735d997e3,
01d3331f429d4a3c97fee30e08dae682,
01f44182baa7422cb2f549d98f1a05c9,
020d15ef82bc453ca00d078409d2989c,
The first line is this search - the DOIs are both bogus so it's not a real duplicate.
The second line is this - again, the DOI is the same, and it's from the same publication.
Third is here - the full DOIs are different, but have the same beginning, as they are from the same publication.
The final group all have undefined
as their DOI - a sample is in this search.
The relevant code is in portality.article.XWalk.get_duplicate()
https://doaj.org/toc/2166-6482/29/4 https://doaj.org/toc/2166-6482/29/3 https://doaj.org/toc/2166-6482/29/2 https://doaj.org/toc/2166-6482/29/1
In #711 @richard-jones says: "there's no doubt that the matching algorithm is working correctly - I was able to do a mock match for one of these and found the other." https://github.com/DOAJ/doaj/issues/711#issuecomment-127292406 and yet we can see that there are circumstances when this does not work correctly.
In this particular example, the original XML uploads were affected by the recent upload stoppage so perhaps something there is causing this. Thoughts?