CIAT-DAPA / cwr_gap-analysis-cwr

Automatically exported from code.google.com/p/gap-analysis-cwr
1 stars 0 forks source link

Denote GBIF in raw_occurrences data; delete duplicates in GBIF data with other datasets #105

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
We need to find a way to record in raw_occurrences which records come from 
GBIF.  In order to do this, I suggest an additional column in raw_occurrences, 
called "is_GBIF".   For all records from GBIF (in ALL the data, including the 
priority and non_priority data), please put a "1" in this new field.  

For all records that come from GBIF and are "null" or "-" or blank in 
"provider_institute_id", please put "GBIF" in this field.

Original issue reported on code.google.com by colin.kh...@gmail.com on 25 Oct 2012 at 4:41

GoogleCodeExporter commented 9 years ago
mmm I agree we must provide a way to identify where the records come from, 
however I think it is a bad idea to create a columns only for GBIF.

Could it be something more general?

Original comment by htobon on 25 Oct 2012 at 1:25

GoogleCodeExporter commented 9 years ago

Original comment by alexgcv@gmail.com on 25 Oct 2012 at 7:07

GoogleCodeExporter commented 9 years ago
I don't understand why do we need to have a column to define what comes from 
GBIF?

Original comment by la.guane...@gmail.com on 25 Oct 2012 at 8:27

GoogleCodeExporter commented 9 years ago
the reason for denoting GBIF is because its hard to work with the field 
provider_institute_id with a million names from GBIF in it; so we made a way to 
identify GBIF records.  I hope this is fine with you; it is very helpful for 
cleaning and organizing.  the provider_institute_id field is critical for 
helping identify the errors we are seeing in data_public_access and correcting 
them.  

Original comment by colin.kh...@gmail.com on 26 Oct 2012 at 3:24

GoogleCodeExporter commented 9 years ago
We have denoted GBIF in "data_provider_id".  All provider names that were in 
that field have been moved to "institute_name".  We also are placing a "1" for 
all GBIF records in a new field called "from_GBIF"

Original comment by colin.kh...@gmail.com on 26 Oct 2012 at 6:10

GoogleCodeExporter commented 9 years ago
maybe only data_provider_id with "GBIF" is better than other column. what do 
you think?

Original comment by alexgcv@gmail.com on 26 Oct 2012 at 3:50

GoogleCodeExporter commented 9 years ago
We can always delete the from_GBIF field later if we dont need it and it is so 
offensive to some members of our team  ;) .  for now lets put it in, as it is 
helpful to this error cleaning process we are doing

Original comment by colin.kh...@gmail.com on 26 Oct 2012 at 4:00

GoogleCodeExporter commented 9 years ago
delete duplicated records (37 and 38 count each) from GBIF data (Alex is doing)

Original comment by colin.kh...@gmail.com on 31 Oct 2012 at 11:18

GoogleCodeExporter commented 9 years ago
Todavia tenemos un cantidad de duplicaciones entre datos de GBIF y otros 
providers (Bioversity, y USDA_NPGS_GRIN).  Creo que se puede eliminarlos con 
este: 
when institute_id is same/equal between GBIF and Bioversity or USDA_NPGS_GRIN, 
delete registos de GBIF 
con este paso, vamos a hacer un delete de por lo menos 17,000 registros en 
raw_occurrences
(recuerda de hacer lo mismo para los datos GBIF_non priorities).
(un recuerdo en general que datos de GBIF no son de la calidad que son datos 
que recibimos directo de providers.  por eso, cuando hay duplicados, eliminamos 
los de GBIF).

Original comment by colin.kh...@gmail.com on 2 Nov 2012 at 5:03

GoogleCodeExporter commented 9 years ago
delete duplicated records in GBIF that are US genera.  to do this, delete all 
records with filename = raw_CK_GBIF_CWRUS_ff.xlsx (16370 records).

Original comment by colin.kh...@gmail.com on 7 Nov 2012 at 4:38

GoogleCodeExporter commented 9 years ago
done,

+---------------------------------------------------------+----------+
| filename                                                | count(*) |
+---------------------------------------------------------+----------+
| NULL                                                    |  1065176 |
| raw_ABCIC_ff                                            |      592 |
| raw_ATGGC_ff.xlsx                                       |    41860 |
| raw_atlas_guatemala_ff.xlsx                             |     2593 |
| raw_atlas_paraguay_ff.xlsx                              |     2287 |
| raw_Avagyan_Armenia_Try_ff.xlsx                         |       11 |
| raw_AVH_all_ff.xlsx                                     |   104839 |
| raw_AVRDC_ff.xlsx                                       |    52296 |
| raw_BD_digitalization_externos_recopilacion.xlsx        |     2605 |
| raw_BD_Digitization_all_ff.xlsx                         |    10784 |
| raw_BD_Digitization_KK_ff.xlsx                          |      115 |
| raw_BD_Digization_SC_ff.xlsx                            |      800 |
| raw_BhamMaxted_ViciaPhD_ff.xlsx                         |      407 |
| raw_BhamMaxted_Vicieae_CicereaeSovietData_ff.xls        |     4131 |
| raw_BHAM_AEGRO_2009_OriginalData_Avena_ff.xls           |    65534 |
| raw_BHAM_AEGRO_Beta_Data_ORIGINALDATA_ff.xlsx           |    45163 |
| raw_BHAM_African_Vigna_Database_ff.xlsx                 |     9212 |
| raw_BHAM_asivigna_ff.xlsx                               |     1243 |
| raw_Bham_AvenaVersion_3_ff.xlsx                         |    18696 |
| raw_BHAM_CYPRUS_2012_ff.xlsx                            |       16 |
| raw_BHAM_DataCollectionTemplate_G_MU_ff.xlsx            |      232 |
| raw_BHAM_Dias_Lupinus_Portugal_ff.xlsx                  |      810 |
| raw_BHAM_Gumedze_Rhus_Sclerocarya_MSc_Data_ff.xlsx      |      480 |
| raw_BHAM_Harokapakis_Prunus_MSc_Data_ff.xlsx            |      594 |
| raw_BHAM_Hawkes_Solanum_ff.xlsx                         |     8902 |
| raw_Bham_Maxted_Pisum_ff.xls                            |    10342 |
| raw_BHAM_Medicago_ff.xlsx                               |     8121 |
| raw_BHAM_Mitchell_Brassica_PhD_data_ff.xlsx             |     7778 |
| raw_BHAM_Mt_Mulanje_Malawi_ff.xlsx                      |      595 |
| raw_BHAM_Mumtaz_Pisum_PhD_Data_Original_RNG_ff.xlsx     |      116 |
| raw_BHAM_NM_Lathyrus_Data_ff.xlsx                       |      222 |
| raw_BHAM_NM_Trifolium_Data_ff.xlsx                      |     2038 |
| raw_BHAM_NM_Vicia_Data_ff.xlsx                          |      882 |
| raw_BHAM_Nurdin_Psophocarpus_Data_ff.xls                |      275 |
| raw_Bham_PatsiouAvena_ff.xlsx                           |    23889 |
| raw_BHAM_Pisum_COMBINE_ALL_GENEBANK_ABYSSINICUM_ff.xlsx |       49 |
| raw_BHAM_Pisum_COMBINE_ALL_GENEBANK_ASIATICUM_ff.xlsx   |       62 |
| raw_BHAM_Pisum_COMBINE_ALL_GENEBANK_ELATIUS_ff.xlsx     |      142 |
| raw_BHAM_Pisum_COMBINE_ALL_GENEBANK_FULVUM_ff.xlsx      |      138 |
| raw_BHAM_Pisum_COMBINE_ALL_GENEBANK_PUMILIO_ff.xlsx     |       34 |
| raw_BHAM_Pisum_COMBINE_ALL_GENEBANK_TRANSCAUCAS_ff.xlsx |        8 |
| raw_BHAM_Shehadeh_Lathyrus_PhD_Data_ff.xlsx             |     1130 |
| raw_BHAM_Teosinte_Passport_Data_ff.xlsx                 |      172 |
| raw_BHAM_Trifolium_CLOVER_ff.xlsx                       |     2393 |
| raw_BHAM_UK_CWR_Database_ff.xlsx                        |     1450 |
| raw_Bham_WhitehouseSecale_ff.xls                        |       94 |
| raw_BHAM_Zea_Mays_Data_V2_Diploperennis_ff.xlsx         |        7 |
| raw_BHAM_Zea_Mays_Data_V2_huehuetenange_ff.xlsx         |        5 |
| raw_BHAM_Zea_Mays_Data_V2_luxurians_ff.xlsx             |       11 |
| raw_BHAM_Zea_Mays_Data_V2_mexicana_ff.xlsx              |       72 |
| raw_BHAM_Zea_Mays_Data_V2_parviglumis_ff.xlsx           |       90 |
| raw_BHAM_Zea_Mays_Data_V2_perenniss_ff.xlsx             |        4 |
| raw_Bioversity_EUR_SING_1_b_ff.xlsx                     |   109204 |
| raw_Bioversity_EUR_SING_1_c_ff.xlsx                     |   109204 |
| raw_Bioversity_EUR_SING_1_d_ff.xlsx                     |   109204 |
| raw_Bioversity_EUR_SING_1_e_ff.xlsx                     |   109204 |
| raw_Bioversity_EUR_SING_1_f_ff.xlsx                     |    87363 |
| raw_Bioversity_EUR_SING_1_g_ff.xlsx                     |    54602 |
| raw_Bioversity_EUR_SING_1_h_ff.xlsx                     |    32864 |
| raw_Bioversity_EUR_SING_1_i_ff.xlsx                     |    13650 |
| raw_Bioversity_EUR_SIN_2_b_ff.xlsx                      |   109204 |
| raw_Bioversity_EUR_SIN_2_c_ff.xlsx                      |   109204 |
| raw_Bioversity_EUR_SIN_2_d_ff.xlsx                      |   109204 |
| raw_Bioversity_EUR_SIN_2_e_ff.xlsx                      |   109204 |
| raw_Bioversity_EUR_SIN_2_f_ff.xlsx                      |    87363 |
| raw_Bioversity_EUR_SIN_2_g_ff.xlsx                      |    54602 |
| raw_Bioversity_EUR_SIN_2_h_ff.xlsx                      |    31201 |
| raw_Bioversity_EUR_SIN_2_i_ff.xlsx                      |    13650 |
| raw_Bioversity_Musamissions_ff.xlsx                     |      231 |
| raw_BM_ff.xlsx                                          |     2235 |
| raw_Brehm_ff.xlsx                                       |      463 |
| raw_BRLU_ff.xls                                         |      262 |
| raw_CAS_ff.xlsx                                         |    11098 |
| raw_CIAT_arachis_ff.xlsx                                |     2287 |
| raw_CIAT_Hijmans_wildpotato_ff.xlsx                     |     9822 |
| raw_CIAT_Phaseolusdb_2012_ff.xls                        |     8282 |
| raw_CIAT_potatoescleaned_ff.xlsx                        |     9821 |
| raw_CIAT_rice_ff                                        |     5489 |
| raw_CIAT_Vigna_ff                                       |     7733 |
| raw_CIAT_wild_manihot_collections_v2_ff.xlsx            |     4855 |
| raw_CIP_Solanum_acaule_ff.xlsx                          |      336 |
| raw_CK_USDA_NPGS_GRIN_USCWR.xlsx                        |    19569 |
| raw_CONABIO_Teocintle599_ff.xlsx                        |      599 |
| raw_CONABIO_Tripsacum527_ff.xlsx                        |      527 |
| raw_CPNWH_CWR_ff.xlsx                                   |   117605 |
| raw_CPNWH_USCWRExtras_ff.xlsx                           |    16063 |
| raw_CRIA_ff.xls                                         |     8723 |
| raw_DAFF_Daniels_Musa_ff.xlsx                           |       24 |
| raw_E_ff.xlsx                                           |    13766 |
| raw_FSU_ff.xlsx                                         |     4611 |
| raw_F_FieldMuseum_ff.xlsx                               |    13272 |
| raw_GUA_ff.xlsx                                         |     2933 |
| raw_Herbathome_Daucus_ff.xlsx                           |       87 |
| raw_HerbatHome_other_ff.xlsx                            |     5645 |
| raw_HUH_ff.xlsx                                         |    10749 |
| raw_HUH_USCWRextras_ff.xlsx                             |     7988 |
| raw_ICCI_TelAvivUni_ff.xlsx                             |    20108 |
| raw_IITA_GPG2_Oryzawild_ff.xlsx                         |     5780 |
| raw_ILRI_herbarium_ff.xlsx                              |    16383 |
| raw_IRRI_GPG2_Oryzawild_ff.xlsx                         |     2691 |
| raw_IRRI_newdata_ff.xlsx                                |     1453 |
| raw_JABOT_all_ff.xlsx                                   |    18749 |
| raw_KewHerbcat_alldata_ff.xlsx                          |    19110 |
| raw_KewMSB_ff.xlsx                                      |     3799 |
| raw_KewMSB_NamibiaAccessions_new_ff                     |       29 |
| raw_KewSRLI_ff.xlsx                                     |     1418 |
| raw_Kew_RSAdata_ff.xlsx                                 |      102 |
| raw_KHD_DenverBG_ff.xlsx                                |     2139 |
| raw_K_Piggin_Cicer_ff.xlsx                              |      184 |
| raw_K_Piggin_Lens_Vicia_ff.xlsx                         |      121 |
| raw_LAC_biosafety_all_ff.xlsx                           |     3222 |
| raw_LE_natalia_ff.xlsx                                  |      289 |
| raw_LE_Snezhana_ff.xlsx                                 |      610 |
| raw_LISC_ff.xlsx                                        |      499 |
| raw_LISU_ff.xlsx                                        |       51 |
| raw_LISU_genebank_ff.xlsx                               |      250 |
| raw_LISU_RDE_ff.xlsx                                    |     1150 |
| raw_LSLI_ff.xlsx                                        |     4311 |
| raw_MANCH_ff.xlsx                                       |     8716 |
| raw_MA_ff.xlsx                                          |    40851 |
| raw_MHA_ff.xlsx                                         |      839 |
| raw_MW_Mikhail_1038_ff.xlsx                             |     1035 |
| raw_MW_Natalia1_ff.xlsx                                 |      160 |
| raw_NHN_all_ff.xlsx                                     |    68030 |
| raw_NM_Vicieae_and_Cicereae_Soviet_Data_ff.xlsx         |     4131 |
| raw_NY_ff.xlsx                                          |    65502 |
| raw_NY_USCWRextras.ff.xls                               |     1369 |
| raw_PBI_eggplant_new_ff.xlsx                            |     6148 |
| raw_PBI_Solanum_tomato_ff.xlsx                          |     2721 |
| raw_PGRC_Avena_ff.xls                                   |    24829 |
| raw_PH_ff.xlsx                                          |     4292 |
| raw_PH_USCWRExtras_ff.xlsx                              |      424 |
| raw_PoT_ff.xlsx                                         |    10241 |
| raw_P_ff.xlsx                                           |    13681 |
| raw_Spooner_NandCA_ff.xls                               |     3478 |
| raw_Spooner_N_SA_ff.xls                                 |     2061 |
| raw_Spooner_PetotaScone_ff.xls                          |     7584 |
| raw_Spooner_Solanumetuberosum_ff.xlsx                   |       44 |
| raw_UBC_Helianthus_1_ff.xlsx                            |       71 |
| raw_UBC_Helianthus_2_ff.xlsx                            |      435 |
| raw_UCR_ff.xlsx                                         |    21047 |
| raw_UC_Jepson_ff.xlsx                                   |   125410 |
| raw_UniCur_IPK_2010_24052011c1_ff.xlsx                  |     3941 |
| raw_UniCur_IPK_Kilian2007a_ff.xlsx                      |      387 |
| raw_UniCur_IPK_Kilian_HordSNP_ff.xlsx                   |      564 |
| raw_UNL_Neves_Eleusine_ff.xlsx                          |       33 |
| raw_USDA_NPGS_GRIN_extras_ff.xlsx                       |     4585 |
| raw_USDA_NPGS_GRIN_FF.xlsx                              |   374177 |
| raw_USDA_NPGS_GRIN_Hijmans_ff.xlsx                      |    85142 |
| raw_USDA_NPGS_SOS_ff.xlsx                               |      268 |
| raw_USDA_UFL_Kretschmer_Vigna_ff.xlsx                   |      380 |
| raw_USherbarium_ff.xlsx                                 |    25986 |
| raw_UW_Tewks_Capsicum_ff.xlsx                           |       45 |
| raw_Wang2012_soja_ff.xlsx                               |       91 |
| raw_WBDC_Feb11_ff.xls                                   |      313 |
| raw_WVU_all_ff.xlsx                                     |     6348 |
+---------------------------------------------------------+----------+

Original comment by alexgcv@gmail.com on 7 Nov 2012 at 4:50

GoogleCodeExporter commented 9 years ago

Original comment by colin.kh...@gmail.com on 9 Nov 2012 at 6:20