componentjs / crawler.js

registry that crawls github users, used in http://component.github.io
29 stars 6 forks source link

Raw dataset GET http://component-crawler.herokuapp.com/.json contains some duplicate entries #8

Open xmojmr opened 9 years ago

xmojmr commented 9 years ago

For instance component with "repo":"timaschew/component-windows" is reported twice. It can be reproduced in http://component.github.io by looking up "component-windows"

After removing duplicate entries the dataset restored in https://github.com/componentjs/crawler.js/issues/7#issuecomment-91169042 comes down from component count 2813 to 2782.

It is not serious problem

xmojmr commented 9 years ago

@timaschew according to my local test (as suggested in https://github.com/componentjs/crawler.js/issues/7#issuecomment-91330051) there are still 29 duplicates:

trevorgerhardt/view, dandv/get-urls, kelonye/component-credit-card-type, anthonyshort/calendar, anthonyshort/stitch-breakpoints, anthonyshort/date-suffix, camshaft/bootstrap-alerts, camshaft/hire, camshaft/range-fn, segmentio/model-csrf, segmentio/aphrodite-plans, segmentio/aphrodite-tabs, segmentio/highlight-yaml, wryk/save-as, matthewmueller/fullscreen, matthewmueller/aemitter, matthewmueller/mat.io, matthewmueller/device, matthewmueller/helix, wooorm/retext-ast, wooorm/gunning-fog, wooorm/spache-formula, wooorm/coleman-liau, wooorm/dale-chall-formula, wooorm/datalist-interface, frankwallis/gulp-component-resolver, bredele/array-without, bredele/array-compact, bredele/leg

Is it result of some (still) hidden problem or running the user refresh will clean things up?

timaschew commented 9 years ago

Don't know, but another strange issue: your json file is 1.3 MB, the file of jonathan is 3 MB :stuck_out_tongue_winking_eye: ?

xmojmr commented 9 years ago

@timaschew the size difference may be explained by the fact that in my json besides some tag normalization during caching I'm also throwing away some weight not needed for the search tool. Namely:

$redundantCrawlerProperties = array(
    'users'
);
$redundantComponentProperties = array(
    'scripts',
    'main',
    'repository',
    'contributors',
    'author',
    'name',
    'styles',
    'development',
    'dependencies',
    'files'
);
$redundantGithubProperties = array(
    'id',
    'name',
    'full_name',
    'description',
    'homepage',
    'url',
    'size',
    'language',
    'has_issues',
    'has_downloads',
    'has_wiki',
    'watchers',
    'default_branch',
    'master_branch',
    'score'
);
$redundantOwnerProperties = array(
    'login',
    'id',
    'url',
    'site_admin'
);

Number of components and all "key" attributes should match exactly what is in the crawler's json.

I wanted to try your new development readme instructions so that I could see/fix something with the use of debugger, but I could not get past the 1st step. S3 somehow does not like the country where I'm from and maybe also our zip number format(?)

My code review abilities in case of this application without seeing it up and kicking are somehow very limited, sorry

Maybe that the duplicates are result of the service running temporarily out of control till complete shutdown and they don't indicate any serious flaw and this ticket can be closed. Maybe, I don't know :confused:

timaschew commented 9 years ago

the size difference may be explained by the fact that in my json besides some tag normalization during caching I'm also throwing away some weight not needed for the search tool.

ah, okay

I wanted to try your new development readme instructions so that I could see/fix something with the use of debugger, but I could not get past the 1st step. S3 somehow does not like the country where I'm from and maybe also our zip number format(?)

oh, knox is using US Standard as default region. If you have another region you need to specify it. So I updated the readme.