Closed GoogleCodeExporter closed 9 years ago
mmm I agree we must provide a way to identify where the records come from,
however I think it is a bad idea to create a columns only for GBIF.
Could it be something more general?
Original comment by htobon
on 25 Oct 2012 at 1:25
Original comment by alexgcv@gmail.com
on 25 Oct 2012 at 7:07
I don't understand why do we need to have a column to define what comes from
GBIF?
Original comment by la.guane...@gmail.com
on 25 Oct 2012 at 8:27
the reason for denoting GBIF is because its hard to work with the field
provider_institute_id with a million names from GBIF in it; so we made a way to
identify GBIF records. I hope this is fine with you; it is very helpful for
cleaning and organizing. the provider_institute_id field is critical for
helping identify the errors we are seeing in data_public_access and correcting
them.
Original comment by colin.kh...@gmail.com
on 26 Oct 2012 at 3:24
We have denoted GBIF in "data_provider_id". All provider names that were in
that field have been moved to "institute_name". We also are placing a "1" for
all GBIF records in a new field called "from_GBIF"
Original comment by colin.kh...@gmail.com
on 26 Oct 2012 at 6:10
maybe only data_provider_id with "GBIF" is better than other column. what do
you think?
Original comment by alexgcv@gmail.com
on 26 Oct 2012 at 3:50
We can always delete the from_GBIF field later if we dont need it and it is so
offensive to some members of our team ;) . for now lets put it in, as it is
helpful to this error cleaning process we are doing
Original comment by colin.kh...@gmail.com
on 26 Oct 2012 at 4:00
delete duplicated records (37 and 38 count each) from GBIF data (Alex is doing)
Original comment by colin.kh...@gmail.com
on 31 Oct 2012 at 11:18
Todavia tenemos un cantidad de duplicaciones entre datos de GBIF y otros
providers (Bioversity, y USDA_NPGS_GRIN). Creo que se puede eliminarlos con
este:
when institute_id is same/equal between GBIF and Bioversity or USDA_NPGS_GRIN,
delete registos de GBIF
con este paso, vamos a hacer un delete de por lo menos 17,000 registros en
raw_occurrences
(recuerda de hacer lo mismo para los datos GBIF_non priorities).
(un recuerdo en general que datos de GBIF no son de la calidad que son datos
que recibimos directo de providers. por eso, cuando hay duplicados, eliminamos
los de GBIF).
Original comment by colin.kh...@gmail.com
on 2 Nov 2012 at 5:03
delete duplicated records in GBIF that are US genera. to do this, delete all
records with filename = raw_CK_GBIF_CWRUS_ff.xlsx (16370 records).
Original comment by colin.kh...@gmail.com
on 7 Nov 2012 at 4:38
done,
+---------------------------------------------------------+----------+
| filename | count(*) |
+---------------------------------------------------------+----------+
| NULL | 1065176 |
| raw_ABCIC_ff | 592 |
| raw_ATGGC_ff.xlsx | 41860 |
| raw_atlas_guatemala_ff.xlsx | 2593 |
| raw_atlas_paraguay_ff.xlsx | 2287 |
| raw_Avagyan_Armenia_Try_ff.xlsx | 11 |
| raw_AVH_all_ff.xlsx | 104839 |
| raw_AVRDC_ff.xlsx | 52296 |
| raw_BD_digitalization_externos_recopilacion.xlsx | 2605 |
| raw_BD_Digitization_all_ff.xlsx | 10784 |
| raw_BD_Digitization_KK_ff.xlsx | 115 |
| raw_BD_Digization_SC_ff.xlsx | 800 |
| raw_BhamMaxted_ViciaPhD_ff.xlsx | 407 |
| raw_BhamMaxted_Vicieae_CicereaeSovietData_ff.xls | 4131 |
| raw_BHAM_AEGRO_2009_OriginalData_Avena_ff.xls | 65534 |
| raw_BHAM_AEGRO_Beta_Data_ORIGINALDATA_ff.xlsx | 45163 |
| raw_BHAM_African_Vigna_Database_ff.xlsx | 9212 |
| raw_BHAM_asivigna_ff.xlsx | 1243 |
| raw_Bham_AvenaVersion_3_ff.xlsx | 18696 |
| raw_BHAM_CYPRUS_2012_ff.xlsx | 16 |
| raw_BHAM_DataCollectionTemplate_G_MU_ff.xlsx | 232 |
| raw_BHAM_Dias_Lupinus_Portugal_ff.xlsx | 810 |
| raw_BHAM_Gumedze_Rhus_Sclerocarya_MSc_Data_ff.xlsx | 480 |
| raw_BHAM_Harokapakis_Prunus_MSc_Data_ff.xlsx | 594 |
| raw_BHAM_Hawkes_Solanum_ff.xlsx | 8902 |
| raw_Bham_Maxted_Pisum_ff.xls | 10342 |
| raw_BHAM_Medicago_ff.xlsx | 8121 |
| raw_BHAM_Mitchell_Brassica_PhD_data_ff.xlsx | 7778 |
| raw_BHAM_Mt_Mulanje_Malawi_ff.xlsx | 595 |
| raw_BHAM_Mumtaz_Pisum_PhD_Data_Original_RNG_ff.xlsx | 116 |
| raw_BHAM_NM_Lathyrus_Data_ff.xlsx | 222 |
| raw_BHAM_NM_Trifolium_Data_ff.xlsx | 2038 |
| raw_BHAM_NM_Vicia_Data_ff.xlsx | 882 |
| raw_BHAM_Nurdin_Psophocarpus_Data_ff.xls | 275 |
| raw_Bham_PatsiouAvena_ff.xlsx | 23889 |
| raw_BHAM_Pisum_COMBINE_ALL_GENEBANK_ABYSSINICUM_ff.xlsx | 49 |
| raw_BHAM_Pisum_COMBINE_ALL_GENEBANK_ASIATICUM_ff.xlsx | 62 |
| raw_BHAM_Pisum_COMBINE_ALL_GENEBANK_ELATIUS_ff.xlsx | 142 |
| raw_BHAM_Pisum_COMBINE_ALL_GENEBANK_FULVUM_ff.xlsx | 138 |
| raw_BHAM_Pisum_COMBINE_ALL_GENEBANK_PUMILIO_ff.xlsx | 34 |
| raw_BHAM_Pisum_COMBINE_ALL_GENEBANK_TRANSCAUCAS_ff.xlsx | 8 |
| raw_BHAM_Shehadeh_Lathyrus_PhD_Data_ff.xlsx | 1130 |
| raw_BHAM_Teosinte_Passport_Data_ff.xlsx | 172 |
| raw_BHAM_Trifolium_CLOVER_ff.xlsx | 2393 |
| raw_BHAM_UK_CWR_Database_ff.xlsx | 1450 |
| raw_Bham_WhitehouseSecale_ff.xls | 94 |
| raw_BHAM_Zea_Mays_Data_V2_Diploperennis_ff.xlsx | 7 |
| raw_BHAM_Zea_Mays_Data_V2_huehuetenange_ff.xlsx | 5 |
| raw_BHAM_Zea_Mays_Data_V2_luxurians_ff.xlsx | 11 |
| raw_BHAM_Zea_Mays_Data_V2_mexicana_ff.xlsx | 72 |
| raw_BHAM_Zea_Mays_Data_V2_parviglumis_ff.xlsx | 90 |
| raw_BHAM_Zea_Mays_Data_V2_perenniss_ff.xlsx | 4 |
| raw_Bioversity_EUR_SING_1_b_ff.xlsx | 109204 |
| raw_Bioversity_EUR_SING_1_c_ff.xlsx | 109204 |
| raw_Bioversity_EUR_SING_1_d_ff.xlsx | 109204 |
| raw_Bioversity_EUR_SING_1_e_ff.xlsx | 109204 |
| raw_Bioversity_EUR_SING_1_f_ff.xlsx | 87363 |
| raw_Bioversity_EUR_SING_1_g_ff.xlsx | 54602 |
| raw_Bioversity_EUR_SING_1_h_ff.xlsx | 32864 |
| raw_Bioversity_EUR_SING_1_i_ff.xlsx | 13650 |
| raw_Bioversity_EUR_SIN_2_b_ff.xlsx | 109204 |
| raw_Bioversity_EUR_SIN_2_c_ff.xlsx | 109204 |
| raw_Bioversity_EUR_SIN_2_d_ff.xlsx | 109204 |
| raw_Bioversity_EUR_SIN_2_e_ff.xlsx | 109204 |
| raw_Bioversity_EUR_SIN_2_f_ff.xlsx | 87363 |
| raw_Bioversity_EUR_SIN_2_g_ff.xlsx | 54602 |
| raw_Bioversity_EUR_SIN_2_h_ff.xlsx | 31201 |
| raw_Bioversity_EUR_SIN_2_i_ff.xlsx | 13650 |
| raw_Bioversity_Musamissions_ff.xlsx | 231 |
| raw_BM_ff.xlsx | 2235 |
| raw_Brehm_ff.xlsx | 463 |
| raw_BRLU_ff.xls | 262 |
| raw_CAS_ff.xlsx | 11098 |
| raw_CIAT_arachis_ff.xlsx | 2287 |
| raw_CIAT_Hijmans_wildpotato_ff.xlsx | 9822 |
| raw_CIAT_Phaseolusdb_2012_ff.xls | 8282 |
| raw_CIAT_potatoescleaned_ff.xlsx | 9821 |
| raw_CIAT_rice_ff | 5489 |
| raw_CIAT_Vigna_ff | 7733 |
| raw_CIAT_wild_manihot_collections_v2_ff.xlsx | 4855 |
| raw_CIP_Solanum_acaule_ff.xlsx | 336 |
| raw_CK_USDA_NPGS_GRIN_USCWR.xlsx | 19569 |
| raw_CONABIO_Teocintle599_ff.xlsx | 599 |
| raw_CONABIO_Tripsacum527_ff.xlsx | 527 |
| raw_CPNWH_CWR_ff.xlsx | 117605 |
| raw_CPNWH_USCWRExtras_ff.xlsx | 16063 |
| raw_CRIA_ff.xls | 8723 |
| raw_DAFF_Daniels_Musa_ff.xlsx | 24 |
| raw_E_ff.xlsx | 13766 |
| raw_FSU_ff.xlsx | 4611 |
| raw_F_FieldMuseum_ff.xlsx | 13272 |
| raw_GUA_ff.xlsx | 2933 |
| raw_Herbathome_Daucus_ff.xlsx | 87 |
| raw_HerbatHome_other_ff.xlsx | 5645 |
| raw_HUH_ff.xlsx | 10749 |
| raw_HUH_USCWRextras_ff.xlsx | 7988 |
| raw_ICCI_TelAvivUni_ff.xlsx | 20108 |
| raw_IITA_GPG2_Oryzawild_ff.xlsx | 5780 |
| raw_ILRI_herbarium_ff.xlsx | 16383 |
| raw_IRRI_GPG2_Oryzawild_ff.xlsx | 2691 |
| raw_IRRI_newdata_ff.xlsx | 1453 |
| raw_JABOT_all_ff.xlsx | 18749 |
| raw_KewHerbcat_alldata_ff.xlsx | 19110 |
| raw_KewMSB_ff.xlsx | 3799 |
| raw_KewMSB_NamibiaAccessions_new_ff | 29 |
| raw_KewSRLI_ff.xlsx | 1418 |
| raw_Kew_RSAdata_ff.xlsx | 102 |
| raw_KHD_DenverBG_ff.xlsx | 2139 |
| raw_K_Piggin_Cicer_ff.xlsx | 184 |
| raw_K_Piggin_Lens_Vicia_ff.xlsx | 121 |
| raw_LAC_biosafety_all_ff.xlsx | 3222 |
| raw_LE_natalia_ff.xlsx | 289 |
| raw_LE_Snezhana_ff.xlsx | 610 |
| raw_LISC_ff.xlsx | 499 |
| raw_LISU_ff.xlsx | 51 |
| raw_LISU_genebank_ff.xlsx | 250 |
| raw_LISU_RDE_ff.xlsx | 1150 |
| raw_LSLI_ff.xlsx | 4311 |
| raw_MANCH_ff.xlsx | 8716 |
| raw_MA_ff.xlsx | 40851 |
| raw_MHA_ff.xlsx | 839 |
| raw_MW_Mikhail_1038_ff.xlsx | 1035 |
| raw_MW_Natalia1_ff.xlsx | 160 |
| raw_NHN_all_ff.xlsx | 68030 |
| raw_NM_Vicieae_and_Cicereae_Soviet_Data_ff.xlsx | 4131 |
| raw_NY_ff.xlsx | 65502 |
| raw_NY_USCWRextras.ff.xls | 1369 |
| raw_PBI_eggplant_new_ff.xlsx | 6148 |
| raw_PBI_Solanum_tomato_ff.xlsx | 2721 |
| raw_PGRC_Avena_ff.xls | 24829 |
| raw_PH_ff.xlsx | 4292 |
| raw_PH_USCWRExtras_ff.xlsx | 424 |
| raw_PoT_ff.xlsx | 10241 |
| raw_P_ff.xlsx | 13681 |
| raw_Spooner_NandCA_ff.xls | 3478 |
| raw_Spooner_N_SA_ff.xls | 2061 |
| raw_Spooner_PetotaScone_ff.xls | 7584 |
| raw_Spooner_Solanumetuberosum_ff.xlsx | 44 |
| raw_UBC_Helianthus_1_ff.xlsx | 71 |
| raw_UBC_Helianthus_2_ff.xlsx | 435 |
| raw_UCR_ff.xlsx | 21047 |
| raw_UC_Jepson_ff.xlsx | 125410 |
| raw_UniCur_IPK_2010_24052011c1_ff.xlsx | 3941 |
| raw_UniCur_IPK_Kilian2007a_ff.xlsx | 387 |
| raw_UniCur_IPK_Kilian_HordSNP_ff.xlsx | 564 |
| raw_UNL_Neves_Eleusine_ff.xlsx | 33 |
| raw_USDA_NPGS_GRIN_extras_ff.xlsx | 4585 |
| raw_USDA_NPGS_GRIN_FF.xlsx | 374177 |
| raw_USDA_NPGS_GRIN_Hijmans_ff.xlsx | 85142 |
| raw_USDA_NPGS_SOS_ff.xlsx | 268 |
| raw_USDA_UFL_Kretschmer_Vigna_ff.xlsx | 380 |
| raw_USherbarium_ff.xlsx | 25986 |
| raw_UW_Tewks_Capsicum_ff.xlsx | 45 |
| raw_Wang2012_soja_ff.xlsx | 91 |
| raw_WBDC_Feb11_ff.xls | 313 |
| raw_WVU_all_ff.xlsx | 6348 |
+---------------------------------------------------------+----------+
Original comment by alexgcv@gmail.com
on 7 Nov 2012 at 4:50
Original comment by colin.kh...@gmail.com
on 9 Nov 2012 at 6:20
Original issue reported on code.google.com by
colin.kh...@gmail.com
on 25 Oct 2012 at 4:41