AtlasOfLivingAustralia / biocollect

Biocollect front end application
https://biocollect.ala.org.au/
Other
10 stars 15 forks source link

BioCollect test shows duplicate scistarter project #559

Closed sat01a closed 8 years ago

javier-molina commented 8 years ago

It was interesting to see duplicate entries in mongo for SciStarter projects, the only case, I can think of is that SciStarter actually returned duplicate records as part of
https://scistarter.com/finder?format=json&q= call

The other scenario were this can happen is when the Async reindexing is stopped once the SciStarter projects have been deleted, that could be the case of a sudden crash of ecodata.

Rerunning the sciStarter import or reindexing fix the issue though.

No code changes required for this.

On a related note, from the original 73 projects, 38 are ingested, 32 are mark as coming from ALA, and 3 are not reported by SciStarter Finder API call.

The original project ids are:

1569,1480,1400,1378,1368,1318,1313,1303,1245,1205,1146,997,988,987,931,920,917,874,870,869,864,854,850,849,842,828,797,795,791,764,704,689,687,681,647,645,644,643,621,615,614,600,587,582,567,564,554,472,471,446,431,423,416,413,411,403,388,371,345,338,336,334,288,280,168,136,53,44,42,33,32,27,26

Projects listed as coming from ALA:

    1:  The Great Sunflower Project 44
    2:  EpiCollect 53
    3:  Mushroom  Observer 136
    4:  OdonataCentral 288
    5:  International Sea Turtle Observation Registry (iSTOR) 388
    6:  The Great Backyard Bird Count 42
    7:  Seagrass-Watch 280
    8:  RNA World 334
    9:  AnimalsandEarth 336
   10:  WildObs 416
   11:  Wildlife Sightings - Citizen Science 423
   12:  Phylo 446
   13:  MySwan 471
   14:  Journey North 564
   15:  Pollinators.info Bumble Bee Photo Group 567
   16:  Citizen Sort 689
   17:  Save the Tasmanian Devil 791
   18:  Marine Debris Tracker 795
   19:  CyberTracker 797
   20:  Comparing the Behaviors of Wild and Captive Native Songbirds 869
   21:  Independent Generation of Research 917
   22:  Smithsonian Transcription Center 1318
   23:  Indigo V Expeditions 1378
   24:  Roadkill Survey for Road Bikers 621
   25:  The Biodiversity Group 644
   26:  OMEGA-LOCATE 704
   27:  Citizen Science 920
   28:  Spot A Ladybug 988
   29:  Where is the Elaphrus Beetle? 997
   30:  nQuire-it 1146
   31:  Scaling up marsh science 1368
   32:  Global Whale Tracking with Happywhale 1480

These projects are not coming from finder API but can be retrieved directly [1569, 582, 371]

@pbrenton probably you want to update the list of projects coming from SciStarter.

temi commented 8 years ago

@javier-molina I have a feeling this happened because the project got deleted from mongo but did not delete effectively from elastic search. This explains why same project appeared twice. As a solution, what do you think of deleting all scistarter projects from homepage index before creating scistarter projects. This way we can be sure all projects are deleted.

javier-molina commented 8 years ago

After finishing this analysis I came across to the other scenario I was suspecting. The search for all projects call to SciStarters returns duplicates some times, so I think we need to make sure we process the same Id only once.

The other scenario were mongo and ES are out of sync is more an environment issue and can be easily solved by a full reindex, I wouldn't try to code anything for this scenario.

I will take this ticket back to in progress to address the first scenario.

javier-molina commented 8 years ago

https://upsource.ala.org.au/ecodata/review/ECODATA-CR-58 https://github.com/AtlasOfLivingAustralia/ecodata/pull/286