fluby / neonetods

Data and associated tools for the NEON Existing Terrestrial Organism Data Survey
9 stars 4 forks source link

Sources missing from sources table #2

Closed ethanwhite closed 2 months ago

ethanwhite commented 11 years ago

The current sources table only has 73 sources, which is many fewer than the number of sources contained in the species lists.

From Ben: "The sources table only contains sources that were (A) referenced in the species lists and (B) recognized as valid Mendeley URLs and successfully looked up via a Mendeley API call. Sources are only looked up once, the first time they're encountered, and then the results are cached for the next time that source is seen. I made this decision to optimize processing time, but it would be straightforward to write a script that looked up all of the sources and loaded them into the sources table."

ethanwhite commented 11 years ago

Looks like much the major source of issues was already addressed in fixing #3, but there are probably a few sources that still aren't making it in.

ethanwhite commented 11 years ago

Here is a list of source urls that are currently failing to import properly:

http://www.mendeley.com/research/vertebrate-fauna-ichauway-baker-county-ga/ http://www.mendeley.com/download/shared/2058663/5069088332/365555136935f1a4359acba8df8ed6b62d867fdc/dl.html http://www.mendeley.com/download/shared/2058663/5072967812/7460a93d356b55a52ff964f05426339f1cc06e01/dl.html http://people.virginia.edu/~dec5z/maps.html http://www.mendeley.com/c/5076417672/g/2058663/euliss-1996-ecological-studies-at-the-woodworth-study-area-terrestrial-bird-communities-on-the-woodworth-stud http://www.mendeley.com/download/shared/2058663/5069153072/b56cdadf0de3daf1d26a726e50381aaf134a85f4/dl.html http://www.mendeley.com/c/5007365232/g/2058663/north-sterling-state-park-2012-north-sterling-state-park-birders-complete-checklist/ http://blandy.virginia.edu/arboretum/woody-plants-db http://www.mendeley.com/c/5076425512/g/2058663/meyer-1985-classification-of-native-vegetation-at-the-woodworth-station-north-dakota-duplicated-copy-for-dcfs/ http://www.mendeley.com/c/5076345952/g/2058663/higgins-1992-waterfowl-production-on-the-woodworth-station-in-south-central-north-dakota--1965-1981-dup http://www.mendeley.com/c/5077793952/g/2058663/drew-2012-the-vascular-flora-of-ichauway--baker-county--georgia--a-remnant-longleaf-pine--wiregrass-ecosystem/ http://www.mendeley.com/download/personal/4963921/4811524751/f1f0fcf77097f0f64ecde44fc1e355865c5b1ffd/dl.html http://ecosystems.mbl.edu/PIE/data/LTE/data/LTE-MD-VEGQUADS.csv http://www.mendeley.com/c/5014915452/g/2058663/beckett-1982-forest-vegegation-and-vascular-flora-of-reek-brake-research-natural-area-alabama/ http://www.mendeley.com/c/5007859052/g/2058663/meyer-1985-classification-of-native-vegetation-at-the-woodworth-station-north-dakota/ http://www.mendeley.com/c/5001200882/g/2058663/meyer-1985-classification-of-native-vegetation-at-the-woodworth-station-north-dakota/ http://www.mendeley.com/c/5001221992/g/2058663/meyer-1996-upland-vegetation-at-the-woodworth-study-area/ http://www.mendeley.com/c/5007902582/g/2058663/euliss-1996-ecological-studies-at-the-woodworth-study-area-effects-of-water-level-changes-on-prairie-pothole-vegetation-structure-and-diversity-in-the-woodworth-study-area--north-dakota/ http://www.mendeley.com/c/5018302512/g/2058663/shears-1999-central-arizona--phoenix-lter-deb-9714833-land-use-change-and-ecological-processes-in-an-urban-ecosystem-of-the-sonoran-desert-annual-progress-report-1999-2000/ http://www.mendeley.com/c/5009858662/g/2058663/hanson-1989-coleoptera-species-inhabiting-prairie-wetlands-of-the-cottonwood-lake-area-stutsman-county-north-dak http://www.mendeley.com/c/5050303512/g/2058663/rice-2010-niche-preference-of-a-coprophagous-scarab-beetle--coleoptera--scarabaeidae--for-summer-mo http://www.mendeley.com/c/5017520222/g/2058663/cavey-2004-survey-report-on-the-leaf-beetles-of-cove-point-lng-property-and-avicinity-calvert-county-maryland/ http://vectormap.nhm.ku.edu/vectormap/ http://www.mendeley.com/research//c/4981987782/g/2058663/forest-no-title// http://www.mendeley.com/c/5018231082/g/2058663/conservation-2000-outdoor-alabama-volumes-72-73/ http://www.mendeley.com/c/5001218602/g/2058663/genet-2001-the-lizard-community-of-a-subtropical-dry-forest--guanica-forest--puerto-rico/ http://www.mendeley.com/download/shared/2058663/5069081292/aa40e99d20bcf1280f7493296e0bd622803c5c92/dl.html http://www.nps.gov/romo/naturescience/amphibians_reptiles.htm

If someone has a chance to look at these and figure out what is special about them or why they are failing that would be useful. If they have been rewritten by Mendeley then is we can run down the current url and put the two side by side in a csv file that would make it easier to fix things.

fluby commented 11 years ago

some of the sources are the original urls, rather than the mendeley url - those will be easy to fix.

ethanwhite commented 11 years ago

The sources with the standard:

http://www.mendeley.com/c/*

structure all seem to be resolving fine in the browser.

Other Mendeley urls may represent cases where the wrong url was grabbed?

ethanwhite commented 11 years ago

According to a little command line magic:

cat sp_list_* | cut -d : -f 2 --only-delimited | sort | uniq | wc

there should be 346 (347 - 1; due to a single example of "race: lilianae") unique sources from across all of the different species lists. 317 are currently being imported successfully. 28 are ending up in failed_sources. That is 345 in total. So, the combined number is in the right ball park. We may be missing one or I may be missing a special case in the command line check.

fluby commented 11 years ago

I have identified the correct urls for 7 of these references. There are 15+ that show up in the failed sources but follow this format: http://www.mendeley.com/c/* Am I missing something?

We will work through the failed sources, and then do finds and replaces on them.

fluby commented 11 years ago

I put the list of failed sources and the corrections thus far into the data folder (failed_sources_corrected_urls.xlsx). NEON peeps will get on this at our earliest convenience.

ethanwhite commented 11 years ago

I don't understand why the http://www.mendeley.com/c/* files are failing either, especially since they resolve just fine in the browser. We'll have to fire up a debugger and watch to see what is going on.