EOL / tramea

A lightweight server for denormalized EOL data
Other
2 stars 1 forks source link

Harvesting Queue / Status #186

Open JRice opened 8 years ago

JRice commented 8 years ago

Pending Resources

  1. failed port should be diagnosed first GBIF Germany, DwC-A, was supposed to be small (test collection), but turned out to be 53,000 entries. TB.
  2. Mineralogy, DwC-A, ~250,000 taxa (579,210 entries), TB
  3. failed port should be diagnosed first GBIF Brazil, DwC-A, small
  4. failed port should be diagnosed first GBIF Netherlands, DwC-A, 177,000 TB records
  5. failed port should be diagnosed first GBIF Sweden, DwC-A, 122,000 TB records
  6. failed port should be diagnosed first GBIF France, DwC-A, 236,000 TB records
  7. failed port should be diagnosed first GBIF UK, DwC-A, 428,000 TB records
  8. flickr, connector, ~30,000 taxa, ~200,000 images connector has run, ready for harvest
  9. WoRMS, connector, 500,000 taxa harvesting
  10. CalPhotos, xml, 23,000 taxa, 174,000 images
  11. PaleoDB, DwC-A, TB, 264,000 taxa
  12. iDigBio type data, DwC-A, ~1,000,000 TB records. Very similar to the GBIF type records. The full resource for this will probably be a couple hundred thousand taxa
  13. Discoverlife maps, connector, ~600,000 maps connector has run, set to force harvest
  14. NHM London type data , DwC-A, 300k TB records, but 2.4M rows in the measurementOrFact file, counting rows for metadata
  15. German wikipedia, DwC-A, large, good sample of weird text objects with images embedded, etc. hold off, we might do this via wikidata instead
  16. NMNH primate measurements, DwC-A, TB, small file expired, we need to regenerate from spreadsheet
  17. Flora of Zimbabwe, "currently fine, 10657 taxa, 17932 images, 13996 articles, xml resource file" The names has a problem where some names are just "x" (the times symbol, actually).
JRice commented 8 years ago

Completed Resources

A new comment for completed resources (whether or not they have gone through top_images, since that takes so long):

  1. Anne's: 245 entries.
  2. Bees
  3. Beetles
  4. Butterflies
  5. Feller 2067 entries
  6. Youtube, "currently in good shape, 242 taxa, 330 videos, connector"
  7. Moorea BioCode, "currently fine, 4519 taxa, 21047 images, xml resource file" <-- Actually 5390 taxa, here, but that's fine. Took about 32 hours.
  8. zookeys xml, several thousand taxa, media
  9. phyokeys xml, several hundred taxa, media
  10. mycokeys xml, small, media
  11. JHRxml, small, media
  12. IJM xml, small, media
  13. SB xml, small, media
  14. BDJ DwC-A, very small, media
  15. DEZ xml, small, media
  16. NL xml, small, media
  17. ZSE xml, small, media
  18. BHL photostream, xml, 11072 taxa, media
  19. Mobot data, spreadsheet, previously harvested but never published, Validated, TB, 700 taxa Sarah is checking it
  20. Bioimages, xml, 1153 taxa, images
  21. Barton, Finkel et al, 2013, DwC-A, small, TB
  22. Barton, Pershing et al, 2013, DwC-A, small, TB
  23. Vines of Puerto Rico and Virgin Islands spreadsheet, previously harvested but never published, Processed, ~600 taxa, text and TB
  24. Olenina et al, 2006, DwC-A, small
  25. AmphibiaWeb, "currently fine, 2154 taxa, 7623 articles, xml resource file"
  26. Prokaryotes DwC-A, small, TB
  27. Mexican amphibians DwC-A, small, TB
  28. Bioimages, xml, 1153 taxa, images (running again to make sure everything gets indexed)
  29. Washington Bird Phenology, DwC-A, small, TB
  30. Coral Skeletons, DwC-A, small, TB
  31. Egg Characteristics and zomgthisisalongname (again)
  32. Odonata, DwC-A, 5800 taxa, TB
  33. Bioluminescent, DwC-A, small, TB (diff, eg: removed Pleurobrachia and Hormiphora)
  34. DC Birds, DwC-A, small, text objects (diff, eg: updated attribution in http://eol.org/data_objects/33118216, Shapiro -> Unknown)
  35. Edwards, DwC-A, TB, small
  36. Chen and Moles, DwC-A, TB, small, measurements + associations
  37. vimeo, connector, xml, small, media objects. Some objects missing, possible tag problems or connector timing issue
  38. scleractinia lifestyle, DwC-A, TB, small
  39. DC Birds, DwC-A, small, text objects (diff, eg: working IUCN links)
  40. Youtube, connector, small (one object missing, but others newly added, so this is unlikely to be a harvest problem)
  41. NEW Mammal diets, DwC-A, 5,000 taxa 26,000 TB records
  42. Arctic amphibians and reptiles, DwC-A, TB, small
  43. Arctic Birds, DwC-A, TB, small
  44. Arctic Freshwater Fishes, DwC-A, TB, small
  45. Arctic lichens, DwC-A, TB, 2200 taxa
  46. Arctic lichen ecology, DwC-A, TB, small
  47. Arctic liverworts, DwC-A, TB, small
  48. Arctic mammals, DwC-A, TB, small
  49. Arctic marine fishes, DwC-A, TB, small
  50. Arctic protists, DwC-A, TB, small
  51. Arctic Register of Marine Species, DwC-A, TB, 4200 taxa
  52. Arctic vascular plants, DwC-A, TB, 2000 taxa
  53. Arctic algae, DwC-A, TB, 1900 taxa
  54. WWF, the big file, DwC-A, 30,000 taxa, 455,000 TB records
  55. Gymnodiales, DwC-A, TB, small
  56. Arctic Alaskan Arthropods, DwC-A, taxa only, small
  57. Alaskan Arthropods, DwC-A, TB, 7000 taxa
  58. Biodiversity of Tamborine Mountain, connector, small
  59. CalPhotos, xml, 23,000 taxa, 174,000 images updating this with partner fixes would be good
  60. Gymnodiales, revised, DwC-A, TB, small
  61. Life History Characteristics of Placental Non-Volant Mammals, DwC-A, 1400 taxa (2171 entries), TB
  62. Carnivore Dinosaurs, DwC-A, small (new), TB
  63. Parrot Fish, DwC-A, small (new), TB
  64. Odonata Other Measurements, DwC-A, 1600 taxa (new), TB
  65. test file GBIF Brazil, DwC-A, small
  66. Moorea biocode, DwC-A, 4500 taxa, 21,000 images
  67. Macroecological database of mammalian body mass, DwC-A, 4800 taxa, TB
    1. Reptile Mass, DwC-A, 6200 taxa, TB
  68. NMNH Birds with broken audio removed, xml, 3600 taxa, images and text
  69. Mikesell phenological data, DwC-A, small (new), TB
  70. Social systems of mammals, DwC-A, small, TB
  71. Dinosaur Data, DwC-A, small, TB
  72. Egg Characteristics and Breeding Season for Woods Hole Species, DwC-A, small, TB
  73. Life history data of lizards, DwC-A, small, TB
  74. Male tenure length, DwC-A, small, TB
  75. Coral skeletons, DwC-A, small, TB (updated)
  76. DC flowers, DwC-A, small, TB
  77. Amphibiaweb, connector, 2,000 taxa
  78. BHL photostream, xml, ~11,000 taxa, images, ready to update with new and improved synonym behaviour
  79. Avian Mass Data, DwC-A, 9400 taxa, TB
  80. Youtube, connector, small (updated machine tags)
  81. inaturalist, DwC-A, ~44,000 taxa, ~1M images processing
  82. Tai EOL, XML, 900 taxa, 8000 items.
  83. Amphibiaweb, connector, 2,000 taxa, now up to date
  84. NMNH Birds, connector, 3650 taxa, media & text
  85. Benedetti, DwC-A, TB, small TB records not showing yet
  86. Mexican amphibiansDwC-A, small, TB genus dependent, checking for better merges
  87. Pterosaur Data, DwC-A, small, TB
  88. Toxic, DwC-A, small, TB
  89. Bird incubation, DwC-A, small, TB
  90. Life History Characteristics of Placental Non-Volant Mammals, DwC-A, 1400 taxa (2171 entries), TB
  91. Eastern US old fields plant traits, DwC-A, small, TB
  92. Carnivore Dinosaurs, DwC-A, small, TB
  93. Parrot Fish, DwC-A, small, TB
  94. Dinosaur Data, DwC-A, small, TB
  95. Reptile Mass, DwC-A, 6200 taxa, TB
  96. Social systems of mammals, DwC-A, small, TB
  97. Macroecological database of mammalian body mass, DwC-A, 4800 taxa, TB
  98. Mikesell phenological data, DwC-A, small, TB
  99. IUCN structured data, connector, TB, 70,000 taxa
  100. IUCN Red List, connector, 70,000 taxa, 400,000 articles