INCATools / biosample-analysis

analysis of biosamples in INSDC
3 stars 1 forks source link

pandas describe() output from parquet file #14

Closed realmarcin closed 4 years ago

realmarcin commented 4 years ago

The parquet file loaded on my laptop in < 30 min, and the describe() call for this data frame (for all columns) took a few minutes. I am linking the describe() result in this ticket. Its a TSV file but github didn't like ...

harmonized-table.parquet_describe.txt

cmungall commented 4 years ago

And here it is as a md table! Thanks!!

accession urogenit_disord gap_consent_short_name phenotype dose tot_nitro tot_depth_water_col nitrate samp_store_dur perturbation samp_mat_process ethnicity child_of extrachrom_elements heavy_metals_meth urogenit_tract_disor agrochem_addition tot_mass cell_line dermatology_disord submitted_subject_id nitro host_body_habitat air_temp slope_gradient gravidity dev_stage disease host_health_state watering_regm forma gap_sample_id host_tot_mass stud_book_number douche molecular_data_type host_disease sediment_type diss_carb_dioxide gaseous_environment gravity silicate annual_season_temp tot_phosphate season_environment alkyl_diethers gap_subject_id tot_part_carb salinity resp_part_matter diether_lipids samp_sort_meth gap_accession host_family_relationship sex affection_status biochem_oxygen_dem host_pulse subclone cultivar super_population_code liver_disord env_broad_scale host_phenotype pathogenicity chlorophyll humidity same_as phosplipid_fatt_acid samp_collect_device surf_temp clone_lib culture_collection passage_history estimated_size source_material_id menopause component_organism derived_from host_substrate pre_treatment air_temp_regm org_carb gestation_state water_content_soil weight_loss_3_month menarche water_temp_regm diss_hydrogen identified_by diss_inorg_carb env_medium part_org_carb time_since_last_wash infra_specific_name host_growth_cond mean_frict_vel ploidy birth_control wastewater_type breeding_history tot_org_carb dna_source indoor_space breed texture treatment body_mass_index flooding bishomohopanol depth urine_collect_meth dew_point growth_hormone_regm host_sex nitrite project_name foetal_health_stat carbapenemase medic_hist_perform potassium sexual_act subject_is_affected diss_oxygen ventilation_rate propagation subsrc_note tidal_stage extreme_event host_diet water_content_soil_meth standing_water_regm isolate paragraph host_disease_outcome study_design env_package size_frac photon_flux suspend_part_matter geo_loc_name strain building_setting abs_air_humidity host_hiv_stat health_state alkalinity serogroup ammonium chem_mutagen tiss_cult_growth_med soluble_inorg_mat birth_location occupant_dens_samp store_cond package_name n_alkanes sulfide drug_usage bacteria_carb_prod isolate_name_alias life_stage fluor gap_consent_code pathotype analyte_type organism_count tissue body_habitat sulfate host_infra_specific_name chloride biospecimen_repository_sample_id soil_type body_product fungicide_regm tissue_lib host_subject_id samp_salinity down_par study_complt_stat cur_land_use status conduc karyotype twin_sibling previous_land_use_meth altitude salt_regm turbidity bio_material aminopept_act pulmonary_disord filter_type pet_farm_animal amniotic_fluid_color virus_enrich_appr occupation soluble_react_phosp diss_org_nitro tertiary_treatment density status_date infra_specific_rank diss_org_carb redox_potential package description tot_carb breeding_method biomass pesticide_regm super_population_description host_description climate_environment entrez_label salinity_meth taxonomy_id antibiotic_regm growth_protocol histological_type non_mineral_nutr_regm plant_body_site al_sat oxygen mechanical_damage lab_host narms_isolate_number calcium pregnancy gynecologic_disord edta_inhibitor_tested ph mineral_nutr_regm smoker nose_throat_disord anamorph phosphate isol_growth_condt sewage_type substructure_type host_last_meal kidney_disord humidity_regm ref_biomaterial subspecf_gen_lin attribute texture_meth is_tumor efficiency_percent gastrointest_disord family_id surf_moisture_ph variety fire light_type pulse org_particles num_replicons disease_stage ventilation_type cell_type emulsions host_genotype host_color sample_name mean_peak_frict_vel serovar typ_occupant_dens compound primary_prod subgroup haplotype heat_cool_type time_last_toothbrush orgmod_note hiv_stat host_blood_press_syst pool_dna_extracts surf_material rainfall_regm chem_oxygen_dem reactor_type dry_mass microbial_biomass diet rel_to_oxygen ph_meth light_intensity part_org_nitro morphology soil_type_meth maternal_health_stat forma_specialis glucosidase_act host_body_mass_index tot_org_c_meth tot_n_meth db_id bac_prod travel_out_six_month crop_rotation biotic_relationship hrt herbicide_regm source_name sub_species study_name misc_param host_dry_mass encoded_traits cur_vegetation diss_inorg_nitro phaeopigments tillage sodium plant_product population samp_store_temp substrain local_class fao_class isolation_source elev previous_land_use wet_mass pollutants heavy_metals taxonomy_name bromide occup_samp microbial_biomass_meth host_blood_press_diast water_current submitter_handle last_meal age height_or_length wind_direction host_body_temp env_local_scale study_disease link_class_info org_nitro surf_air_cont primary_treatment sludge_retent_time entrez_target specimen_voucher blood_blood_disord horizon ihmc_medication_code host_occupation suspend_solids space_typ_state carb_dioxide methane diss_inorg_phosp substrate image_file indoor_surf cur_vegetation_meth inorg_particles indust_eff_percent investigation_type cell_subtype tot_diss_nitro drainage_class surf_humidity host_infra_specific_rank repository local_class_meth type_status lat_lon biomaterial_provider host_age org_matter gaseous_substances chem_administration population_description omics_observ_id wind_speed diet_last_six_month soluble_org_mat host_life_stage profile_position secondary_treatment al_sat_meth beta_lactamase_family host pathovar host_height host_body_product link_addit_analys host_wet_mass model carb_nitro_ratio subtype sample_type risk_group pressure experimental_factor horizon_meth ecotype growth_med surf_moisture samp_size hysterectomy oxy_stat_samp host_length clone biospecimen_repository authority nose_mouth_teeth_throat_disord entrez_value source_uvig death_date outbreak particle_class reference_material teleomorph special_diet link_climate_info dominant_hand collected_by title slope_aspect ph_regm collection_date host_disease_stage biosample_id annual_season_precpt tot_inorg_nitro biovar host_shape magnesium barometric_press sieving temp radiation_regm atmospheric_data genotype host_tissue_sampled fertilizer_regm samp_store_loc submitted_sample_id petroleum_hydrocarb extreme_salinity serotype build_occup_type rel_air_humidity race mating_type bac_resp porosity birth_date tot_phosp water_content samp_vol_we_dna_ext host_taxid label type_strain family_relationship trophic_level
count 14300583 0 4079113 185928 13196 16138 3162 16333 12891 20305 120834 98473 205 3418 491 485 8292 8541 425115 690 5077584 6350 66169 9449 1843 3519 957862 443174 135730 3772 491 5078118 10908 10080 0 2023581 666921 284 10148 505 707 5310 9550 306 2344 400 5078118 602 28475 10 1393 422 5078148 7431 6307028 7 278 2889 303 521025 2583 2290 1545145 11606 3233 8659 286 106 1548 172673 1531 73 63448 32111 100682 197228 432 3114 16749 2496 1316 3345 5307 4411 9593 1161 0 1110 1557 30567 6119 1490664 985 20 93 2807 391 102502 905 3223 14197 11112 20 10498 268720 7291 457847 60583 952 348 366794 933 451 694 337681 6103 642457 132 561 2671 6492 764 2144313 10281 71 95652 12446 788 2441 41869 5757 783 1719895 2172583 18296 4303035 216737 1401 138 751 4335007 2258713 10159 8938 4583 51487 4364 1343 13542 813 716 116 28488 8944 25959 14300584 410 2083 2390 507 169736 25321 1339 4093905 10558 4490071 17414 5490126 79382 12743 6702 9045 5078118 15895 79882 1191 3103 405118 3783 253 607 10820 14300584 9073 9939 586 573 112875 694 1359 11664 540 2194 9415 2098 1180 728 32 245 1817 273 6191 14300584 68 9806 3131 14300584 947897 2457 13744 1726 700 2583 27103 904 4957727 1303 14300584 1013 39913 1532046 694 8231 867 15652 934 29179 2909 8361 597 305 562 75716 808 10115 418 59 15285 253072 2464 159 7197 421 1323 170659 4682 0 2494 2347296 283 11655 40971 159 19988 1820 9438 1702 84 155002 56423 9145 876200 85 36746 5468 2853705 391 279570 9146 22832 229 25293 370 9165 489 1847 2955 1120 3680 708 1267 538 3832 1329 3159 55015 66442 16339 1430 655 1786 2343 418 1100 348 52869 5803 5751 14300584 441 797 4548 18620 0 930 1562938 352367 5079064 29085 1935 1040 10190 741 1104 943 6833 6966 97692 56872 20797 2655 2938 2549710 331264 4438 1 141 913 14300584 1734 9152 2917 1120 1225 5078148 606 1584572 18528 103 6496 1478832 1128467 538 1847 159 362 1193 4990218 78951 135 4256 1719 4161 673 9135 9008 6579 790 3676 1176 1649 784 177 184 464910 40882 5530 2268 159 5390 9156 1452 6265 2665127 634126 242965 2666 192 23678 2621 1497 700 127 208 32403 912 752 432 562 2484912 2329 33960 98288 740 692 14300584 1934 23154 818835 6022 3002 15577 414 399295 3250 159 145914 28 8634 3146 5670 5078148 582 951 4990218 2400 11173 945 281 323 144 2369 674 28570 1019184 14285655 1160 1007 3821052 9847 14300584 10044 129 1885 2529 7214 1485 6293 123062 1335 386 677255 148339 2019 17083 5078126 439 1124 38361 10155 9291 85699 25974 150 999 33672 4292 2689 29968 204128 49937 1584 26878 5102
unique 14300583 0 616 9008 384 5710 500 2811 571 686 6050 1278 81 42 9 13 502 1632 39353 44 2478412 616 469 233 299 81 25801 6972 612 215 23 3996836 2030 3747 0 41 5084 37 439 9 5 2029 696 109 18 58 2797370 137 5035 7 7 59 1772 831 473 3 95 333 39 119605 5 34 17586 600 110 2279 91 106 65 4286 344 73 27149 2003 5758 115411 5 127 2206 88 63 251 480 27 3557 5 0 22 149 3025 1359 22520 439 1 14 34 8 565 16 94 955 4287 20 13 11907 780 76580 9411 21 5 12022 4 69 3 50 815 19112 3 5 16 910 41 12 2890 2 342 2552 11 43 2304 184 9 696359 648874 207 25 91 28 3 474 77091 956064 8 93 18 1103 785 88 1972 86 7 24 1655 89 439 129 15 267 118 72 169053 126 607 29 255 178 1780 50059 84 1494 156 1285 3610956 325 105 35 131 166001 165 4 5 113 2 3841 1729 16 9 5251 4 387 2862 192 28 15 18 1055 7 1 107 315 17 979 9157158 3 2162 1089 129 217986 692 293 126 7 5 3234 189 234875 11 163377 110 2046 1837 4 398 26 927 9 2172 2906 3021 12 12 5 3168 12 61 4 23 1800 5908 59 1 180 6 81 9044 249 0 49 6 68 163 9798 2 5996 44 8 39 8 394 1903 18 22758 9 2346 203 2563839 8 5743 76 2223 95 298 136 10 5 1080 17 7 34 23 9 150 129 811 1781 1073 52 670 52 306 167 23 5 47 5 9562 147 49 14300584 284 26 118 24 0 9 140514 1994 1847 8562 92 193 527 314 181 17 1083 36 5156 231 2808 52 56 166120 11397 122 1 13 63 163356 116 17 350 7 8 1395 10 30201 2395 42 233 19670 404 7 119 2 27 116 9 49108 11 13 13 60 245 8 87 481 337 173 1176 6 25 14 10 379 4116 681 10 2 129 14 17 10 171753 32666 8852 277 24 844 30 383 135 3 77 320 62 27 7 28 35539 201 1549 281 23 4 237 717 1090 30036 25 580 1001 5 23730 247 2 6743 3 39 638 1124 1395 163 29 245215 10 1408 184 65 209 33 23 12 11 26170 5176976 272 141 39983 595 14300584 809 11 99 509 2024 185 78 11981 41 40 63432 2682 44 142 3610964 44 349 2265 19 230 261 167 29 109 4446 875 1278 1226 5283 37670 14 1046 51
top SAMEA4759321 GRU Unknown Ustekinumab 45 mg not collected 20 Missing: Not provided missing not applicable missing Caucasian SAME1596757 not applicable not collected NO nothing added Missing: Restricted access HEK293T HE RA5 missing UBERON:feces 23.9 0-5% not gravid adult normal not provided every other day 100 ml to pot tray and 200 ml to pot spontanea 24148 not collected not applicable SNP Genotypes (Array) missing lithogenous Missing: Not provided not applicable not applicable missing 14.1 Not applicable summer not applicable 2817899 2 Missing: Not provided not applicable missing not applicable phs000178 not collected female affected Not applicable not collected subclone 9 not applicable AFR not collected human-associated habitat not collected human and animal missing 30-70% SAMN03291754 missing swab not collected HBCStromBahamasMic011105 missing missing missing not applicable 0 Anopheles gambiae s.s. Human rumen content DMSO maintained 20 degrees Celsius missing pregnant not collected n not applicable missing missing Missing: Not provided feces not applicable 12h DH55 not applicable not applicable missing Combined oral birth control pill effluent not applicable not collected http://www.dsmz.de/catalogues/details/culture/DSM-17241 not applicable not applicable not collected Control Not provided not collected not applicable 0 catheter not collected not applicable female missing Human gut MAGs not collected no not collected Missing: Not provided 2-3 times a week Yes missing not applicable missing Metagenome-assembled genome binned from sequencing reads available in SRR6049666 low Control herbivore gravimetric not applicable not applicable Keywords: GSC:MIxS;MIMARKS:5.0 missing Case-Control host-associated .02 micron not collected not collected USA C57BL/6 urban missing Positive healthy missing B missing not applicable not applicable #N/B not applicable 1person/(4.62.92.4 meters) -80C Generic not applicable missing none not applicable missing adult not collected 1 missing DNA not collected Blood UBERON:feces Missing: Not provided not applicable Missing: Not provided 37 pine_forest UBERON:feces not applicable no not provided not collected -99 concluded vegetable crops live Missing: Not provided Normal not_collected not collected 0 not applicable not applicable Faeces not applicable no COPD not applicable dog not collected filtration child not collected missing Not applicable 4 x 106 2020-05-22T16:22:25 C57BL/6J Missing: Not provided not collected Generic.1.0 This sample was collected during the Tara Pacific expedition (2016-2018). Missing: Not Provided not applicable missing not applicable African missing not collected PRJNA230403 Electrolytic conductivity 9606 not applicable not applicable Blood not applicable root not collected Missing: Not provided not applicable missing 18NY01GT09 Missing: Not provided Nonpregnant previous pre-term birth no Missing: Not provided not applicable FALSE NO Fusarium oxysporum Missing: Not provided missing wastewater treatment plant not applicable not collected NO not collected missing not applicable Feel method No #N/B none 1 not collected distichon No electric light not provided #N/B missing not applicable HVAC induced pluripotent stem cell #N/B C57BL/6J black source Sample 2 not applicable Enteritidis normal DMSO not collected enterica 1 MYCN copy per haploid genome forced air system 4 hours Specimen was also spiked with blood containing Cytauxzoon felis FALSE not applicable no unknown 0.0mm 1250 sequencing batch reactor High Fiber Diet (CMF) not collected omnivore aerobe Missing: Not provided Darkness not collected muscle invasive bladder cancer Gray healthy tritici not applicable not provided LOI dry combustion gas chromatography SRA:ERS1678201, BioSample:SAMEA104013363 not collected NO RR commensal not applicable Brain enterica The Cancer Genome Atlas (TCGA) not collected not collected not applicable vegetable crops not collected 0 not collected Missing: Not provided maize NewZealand cattle -80 MG1655 3% humus, 22 % clay, 38 % silt Cambisol missing 193 not applicable not collected PM2.5 not collected Homo sapiens missing 1 SIR not applicable missing Framingham_SHARe banana not applicable Missing: Restricted access not collected not collected missing Neoplasms not collected missing not collected Not applicable 10 bioproject missing not collected A horizon not collected Student 15-20 g/L typical occupied 477 not applicable not collected wood NEVE/586196_D10_037.png floor not collected Not applicable Not applicable metagenome not applicable Missing: Not reported moderately well not collected not applicable American Type Culture Collection (ATCC) Swiss Federal Research Stations FAL RAC FAW, 1996. Swiss Reference Methods of the Federal Agricultural Research Stations. Soil and Substrate Research for the Fertilizer Advisory Service (Arable, Fodder, Fruit Farming, Viniculture and Horticulture), vol. 1. Swiss Federal Research Stations. non-type strain missing ATCC not provided missing H2,O2,CO2 not applicable Luhya in Webuye, Kenya not applicable 0 not collected #N/B adult not collected activated sludge not collected CTX-M Homo sapiens oryzae not provided stool not collected not collected Generic Missing: Not Provided missing metagenomic assembly Biosafety Level 2 missing gene, marine metagenome, uncultured microorganism, uncultured organism, ecotype, biological replicate, technical replicate, interspecies interaction between organisms not collected not applicable not applicable not collected .1,g missing aerobic not collected not collected Framingham_SHARe (Clemens 1865) not applicable 230403 other not applicable missing 100% silt, 0% sand inclusivity SAFE Mycosphaerella graminicola not collected not collected I am right handed CDC Sample from Homo sapiens not collected not applicable missing missing 8722381 1136.0 not collected Orientalis not applicable Missing: Not provided not collected 10g soil seived at 2mm not collected not applicable not collected wild type UBERON:feces not applicable freezer 37 not applicable not collected missing health care 41 Caucasian a not collected 0.3 not applicable 15.1 µg/L not applicable 0.25 g 9606 missing Yes proband heterotroph
freq 1 1300152 20296 2907 428 418 2257 978 1547 5343 8487 32 2245 300 281 432 2258 39352 108 8454 747 15182 2955 336 582 203452 100445 33603 780 344 576 1939 1770 1088323 226920 63 2242 205 689 777 1149 81 894 184 8499 126 3377 4 757 161 110105 921 3025393 3 81 780 58 115319 687 546 116964 1813 782 791 40 1 757 9910 202 1 10585 5085 61076 4699 278 1826 1837 632 96 780 722 3127 300 980 493 759 3186 1887 146111 184 20 36 363 185 55688 273 429 3032 301 1 5287 88020 1163 11156 2781 299 185 61393 498 151 689 170976 760 92143 121 195 667 1443 225 1149767 767 38 64348 312 398 436 1122 1696 492 279789 230504 3604 1928795 49254 253 105 109 491018 174305 5965 3366 2587 10023 777 339 760 689 492 40 2468 2951 1675 10186430 184 757 624 185 116 11504 105 3104963 4005 4040668 4917 1132113 46870 3021 909 3050 161 1632 47608 689 1520 21298 1661 132 291 1490 13011887 232 712 291 300 34336 492 169 1634 185 381 3189 559 122 330 32 105 721 81 1414 103752 40 1198 270 10186430 56998 276 2968 759 492 687 11014 461 129933 332 6819707 689 3497 499503 492 1613 300 3082 468 2680 3 1483 310 153 260 6565 492 2123 291 8 2388 119284 299 159 1042 291 681 103643 1683 912 2111902 40 2492 705 151 3963 324 5170 572 40 102905 6453 3171 29478 40 1846 1349 545 185 31782 2980 1161 105 15737 58 3937 236 389 1998 402 1250 151 545 48 802 264 300 4981 29205 4906 508 105 340 708 281 416 184 14582 944 738 1 105 281 576 7228 689 16877 173557 110105 1730 1069 145 552 105 246 298 1367 4826 7567 14435 5372 324 442 281602 19645 540 1 53 300 6819707 757 3130 1146 402 702 715094 192 126545 2258 36 1187 110561 168347 300 757 151 81 142 4968773 8458 121 2010 668 1534 110 4983 2951 170 105 370 1 816 300 81 81 226185 4279 1234 930 151 973 4514 504 4094 620664 38513 13050 721 81 2363 115 590 122 121 40 13212 298 145 300 191 1036785 559 16722 18134 300 681 10185822 276 3908 54054 3513 757 3917 300 107273 691 151 18678 26 1999 1077 494 715094 101 621 129948 1672 1718 82 69 98 79 658 300 18683 182255 447386 300 492 336484 3450 1 1146 105 305 489 1486 77 1146 5247 492 106 69573 14700 689 3754 161 184 299 4807 3678 2954 38509 4796 105 146 2621 140 169 3927 92901 293 1053 4343 1057
wdduncan commented 4 years ago

This has been done.